8 months ago

Abstract

Video representation learning has been successful in video-text pre-trainingfor zero-shot transfer, where each sentence is trained to be close to thepaired video clips in a common feature space. For long videos, given aparagraph of description where the sentences describe different segments of thevideo, by matching all sentence-clip pairs, the paragraph and the full videoare aligned implicitly. However, such unit-level comparison may ignore globaltemporal context, which inevitably limits the generalization ability. In thispaper, we propose a contrastive learning framework TempCLR to compare the fullvideo and the paragraph explicitly. As the video/paragraph is formulated as asequence of clips/sentences, under the constraint of their temporal order, weuse dynamic time warping to compute the minimum cumulative cost oversentence-clip pairs as the sequence-level distance. To explore the temporaldynamics, we break the consistency of temporal succession by shuffling videoclips w.r.t. temporal granularity. Then, we obtain the representations forclips/sentences, which perceive the temporal information and thus facilitatethe sequence alignment. In addition to pre-training on the video and paragraph,our approach can also generalize on the matching between video instances. Weevaluate our approach on video retrieval, action step localization, andfew-shot action recognition, and achieve consistent performance gain over allthree tasks. Detailed ablation studies are provided to justify the approachdesign.

Source PDF View Code