Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation power ofthe vision foundation models, such as ViT and Swin, for video understandingwith only a few trainable parameters. Previous adaptation methods havesimultaneously considered spatial and temporal modeling with a unifiedlearnable module but still suffered from fully leveraging the representativecapabilities of image transformers. We argue that the popular dual-path(two-stream) architecture in video models can mitigate this problem. We proposea novel DualPath adaptation separated into spatial and temporal adaptationpaths, where a lightweight bottleneck adapter is employed in each transformerblock. Especially for temporal dynamic modeling, we incorporate consecutiveframes into a grid-like frameset to precisely imitate vision transformers'capability that extrapolates relationships between tokens. In addition, weextensively investigate the multiple baselines from a unified perspective invideo understanding and compare them with DualPath. Experimental results onfour action recognition benchmarks prove that pretrained image transformerswith DualPath can be effectively generalized beyond the data domain.