HyperAIHyperAI
2 months ago

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Xu, Hu ; Ghosh, Gargi ; Huang, Po-Yao ; Okhonko, Dmytro ; Aghajanyan, Armen ; Metze, Florian ; Zettlemoyer, Luke ; Feichtenhofer, Christoph
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model forzero-shot video and text understanding, without using any labels on downstreamtasks. VideoCLIP trains a transformer for video and text by contrastingtemporally overlapping positive video-text pairs with hard negatives fromnearest neighbor retrieval. Our experiments on a diverse series of downstreamtasks, including sequence-level text-video retrieval, VideoQA, token-levelaction localization, and action segmentation reveal state-of-the-artperformance, surpassing prior work, and in some cases even outperformingsupervised approaches. Code is made available athttps://github.com/pytorch/fairseq/tree/main/examples/MMPT.