2 months ago

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Liu, Ye ; He, Jixuan ; Li, Wanhua ; Kim, Junsik ; Wei, Donglai ; Pfister, Hanspeter ; Chen, Chang Wen

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problemthat aims to ground relevant clips in untrimmed videos given natural languagequeries. Most existing VTG models are built upon frame-wise final-layer CLIPfeatures, aided by additional temporal backbones (e.g., SlowFast) withsophisticated temporal reasoning mechanisms. In this work, we claim that CLIPitself already shows great potential for fine-grained spatial-temporalmodeling, as each layer offers distinct yet useful information under differentgranularity levels. Motivated by this, we propose Reversed Recurrent Tuning($R^2$-Tuning), a parameter- and memory-efficient transfer learning frameworkfor video temporal grounding. Our method learns a lightweight $R^2$ Blockcontaining only 1.5% of the total parameters to perform progressivespatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Blockrecurrently aggregates spatial features from earlier layers, then refinestemporal correlation conditioning on the given query, resulting in acoarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performanceacross three VTG tasks (i.e., moment retrieval, highlight detection, and videosummarization) on six public benchmarks (i.e., QVHighlights, Charades-STA,Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additionalbackbone, demonstrating the significance and effectiveness of the proposedscheme. Our code is available at https://github.com/yeliudev/R2-Tuning.