Revealing Single Frame Bias for Video-and-Language Learning

Training an effective video-and-language model intuitively requires multipleframes as model inputs. However, it is unclear whether using multiple frames isbeneficial to downstream tasks, and if yes, whether the performance gain isworth the drastically-increased computation and memory costs resulting fromusing more frames. In this work, we explore single-frame models forvideo-and-language learning. On a diverse set of video-and-language tasks(including text-to-video retrieval and video question answering), we show thesurprising result that, with large-scale pre-training and a proper frameensemble strategy at inference time, a single-frame trained model that does notconsider temporal information can achieve better performance than existingmethods that use multiple frames for training. This result reveals theexistence of a strong "static appearance bias" in popular video-and-languagedatasets. Therefore, to allow for a more comprehensive evaluation ofvideo-and-language models, we propose two new retrieval tasks based on existingfine-grained action recognition datasets that encourage temporal modeling. Ourcode is available at https://github.com/jayleicn/singularity