AM Flow: Adapters for Temporal Processing in Action Recognition

Deep learning models, in particular \textit{image} models, have recentlygained generalisability and robustness. %are becoming more general and robustby the day. In this work, we propose to exploit such advances in the realm of\textit{video} classification. Video foundation models suffer from therequirement of extensive pretraining and a large training time. Towardsmitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" forimage models, a method for identifying pixels relevant to motion in each inputvideo frame. In this context, we propose two methods to compute AM flow,depending on camera motion. AM flow allows the separation of spatial andtemporal processing, while providing improved results over combinedspatio-temporal processing (as in video models). Adapters, one of the populartechniques in parameter efficient transfer learning, facilitate theincorporation of AM flow into pretrained image models, mitigating the need forfull-finetuning. We extend adapters to "\textit{temporal processing adapters}"by incorporating a temporal processing unit into the adapters. Our workachieves faster convergence, therefore reducing the number of epochs needed fortraining. Moreover, we endow an image model with the ability to achievestate-of-the-art results on popular action recognition datasets. This reducestraining time and simplifies pretraining. We present experiments onKinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasingstate-of-the-art or comparable results.