HyperAIHyperAI

Command Palette

Search for a command to run...

Lightweight Recurrent Cross-modal Encoder for Video Question Answering

Cheol Jeong Steve Andreas Immanuel

Abstract

A video question answering task essentially boils down to how to fuse the information between text and video effectively to predict an answer. Most works employ a transformer encoder as a cross-modal encoder to fuse both modalities by leveraging the full self-attention mechanism. Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames. Training only from offline-extracted features suffers from the disconnection between the extracted features and the data of the downstream task because the video and text feature extractors are trained independently on different domains, e.g., action recognition for the video feature extractor and semantic classification for the text feature extractor. Training using sparsely-sampled video frames might suffer from information loss if the video contains very rich information or has many frames. To alleviate those issues, we propose Lightweight Recurrent Cross-modal Encoder (LRCE) that replaces the self-attention operation with a single learnable special token to summarize the text and video features. As a result, our model incurs a significantly lower computational cost. Additionally, we perform a novel multi-segment sampling which sparsely samples the video frames from different segments of the video to provide more fine-grained information. Through extensive experiments on three VideoQA datasets, we demonstrate the LRCE achieves significant performance gains compared to previous works.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp