HyperAI超神経
ホーム
ニュース
最新論文
チュートリアル
データセット
百科事典
SOTA
LLMモデル
GPU ランキング
学会
検索
サイトについて
日本語
HyperAI超神経
Toggle sidebar
サイトを検索…
⌘
K
ホーム
SOTA
Visual Question Answering
Visual Question Answering On Msvd Qa 1
Visual Question Answering On Msvd Qa 1
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
Columns
モデル名
Accuracy
Paper Title
Repository
MuLTI
0.547
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
-
OmniVL
0.510
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
HMEMA
0.337
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
DualVGR
0.390
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
VAST
0.60
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
HCRN
0.361
Hierarchical Conditional Relation Networks for Video Question Answering
Clover
0.524
Clover: Towards A Unified Video-Language Alignment and Fusion Model
All-in-one-B
0.483
All in One: Exploring Unified Video-Language Pre-training
AIO+MIF
0.467
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
InternVideo
0.555
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoCoCa
0.569
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
All-in-one+
0.438
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
X2-VLM (base)
0.528
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
ST-VQA
0.313
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
MaMMUT (ours)
.602
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
UMT-L (ViT-L/16)
0.552
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LRCE
0.478
Lightweight Recurrent Cross-modal Encoder for Video Question Answering
Co-Mem
0.317
Motion-Appearance Co-Memory Networks for Video Question Answering
-
X2-VLM (large)
0.546
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VIOLET + MELTR
0.517
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.
Previous
Next