HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Msvd Qa 1
Visual Question Answering On Msvd Qa 1
Metrics
Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Accuracy
Paper Title
VLAB
0.61
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MA-LMM
0.606
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
MaMMUT (ours)
.602
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VAST
0.60
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
0.60
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR
0.60
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
mPLUG-2
0.581
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VideoCoCa
0.569
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
GIT
0.568
GIT: A Generative Image-to-text Transformer for Vision and Language
FrozenBiLM+
0.558
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
HiTeA
0.556
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
InternVideo
0.555
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
UMT-L (ViT-L/16)
0.552
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
0.549
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
MuLTI
0.547
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
VIOLETv2
0.547
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
X2-VLM (large)
0.546
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
0.528
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Clover
0.524
Clover: Towards A Unified Video-Language Alignment and Fusion Model
VIOLET + MELTR
0.517
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.
Previous
Next
Visual Question Answering On Msvd Qa 1 | SOTA | HyperAI