6 months ago

Abstract

A video question answering task essentially boils down to how to fuse the information between text and video effectively to predict an answer. Most works employ a transformer encoder as a cross-modal encoder to fuse both modalities by leveraging the full self-attention mechanism. Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames. Training only from offline-extracted features suffers from the disconnection between the extracted features and the data of the downstream task because the video and text feature extractors are trained independently on different domains, e.g., action recognition for the video feature extractor and semantic classification for the text feature extractor. Training using sparsely-sampled video frames might suffer from information loss if the video contains very rich information or has many frames. To alleviate those issues, we propose Lightweight Recurrent Cross-modal Encoder (LRCE) that replaces the self-attention operation with a single learnable special token to summarize the text and video features. As a result, our model incurs a significantly lower computational cost. Additionally, we perform a novel multi-segment sampling which sparsely samples the video frames from different segments of the video to provide more fine-grained information. Through extensive experiments on three VideoQA datasets, we demonstrate the LRCE achieves significant performance gains compared to previous works.

Source PDF View Code