8 months ago

Abstract

Recently, large-scale pre-trained vision-language models (e.g., CLIP), havegarnered significant attention thanks to their powerful representativecapabilities. This inspires researchers in transferring the knowledge fromthese large pre-trained models to other task-specific models, e.g., VideoAction Recognition (VAR) models, via particularly leveraging side networks toenhance the efficiency of parameter-efficient fine-tuning (PEFT). However,current transferring approaches in VAR tend to directly transfer the frozenknowledge from large pre-trained models to action recognition networks withminimal cost, instead of exploiting the temporal modeling capabilities of theaction recognition models themselves. Therefore, in this paper, we propose amemory-efficient Temporal Difference Side Network (TDS-CLIP) to balanceknowledge transferring and temporal modeling, avoiding backpropagation infrozen parameter models. Specifically, we introduce a Temporal DifferenceAdapter (TD-Adapter), which can effectively capture local temporal differencesin motion features to strengthen the model's global temporal modelingcapabilities. Furthermore, we designed a Side Motion Enhancement Adapter(SME-Adapter) to guide the proposed side network in efficiently learning therich motion information in videos, thereby improving the side network's abilityto capture and learn motion information. Extensive experiments are conducted onthree benchmark datasets, including Something-Something V1&V2, andKinetics-400. Experimental results demonstrate that our approach achievescompetitive performance.

Source PDF View Code