HyperAI

Abstract

Video question answering is a challenging task that requires understanding the video and question in the same context. This becomes even harder when the questions involve reasoning, such as predicting future events or explaining counterfactual events, because they need knowledge not explicitly shown. Existing methods use coarse-grained fusion of video and language features, ignoring temporal information. To address this, we propose a novel vision-text fusion module that learns the temporal context of the video and question. Our module expands question tokens along the video's temporal axis and fuses them with video features to generate new representations with local and global context. We evaluated our method on four VideoQA datasets, including MSVD-QA, NExT-QA, Causal-VidQA, and AGQA-2.0.

Abstract

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering

Sanguk Park Dongchan Park Geonwoo Park Mobeen Ahmad

Abstract

Build AI with AI

HyperAI Newsletters