HyperAI超神経

Visual Question Answering On Msvd Qa 1

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
Accuracy
Paper TitleRepository
MuLTI0.547MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling-
OmniVL0.510OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
HMEMA0.337Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
DualVGR0.390DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
VAST0.60VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
HCRN0.361Hierarchical Conditional Relation Networks for Video Question Answering
Clover0.524Clover: Towards A Unified Video-Language Alignment and Fusion Model
All-in-one-B0.483All in One: Exploring Unified Video-Language Pre-training
AIO+MIF0.467Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
InternVideo0.555InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoCoCa0.569VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
All-in-one+0.438Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
X2-VLM (base)0.528X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
ST-VQA0.313TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
MaMMUT (ours).602MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
UMT-L (ViT-L/16)0.552Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LRCE0.478Lightweight Recurrent Cross-modal Encoder for Video Question Answering
Co-Mem0.317Motion-Appearance Co-Memory Networks for Video Question Answering-
X2-VLM (large)0.546X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VIOLET + MELTR0.517MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.