HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
الأسئلة المرئية والإجابة عليها (VQA)
Visual Question Answering On Msvd Qa 1
Visual Question Answering On Msvd Qa 1
المقاييس
Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Accuracy
Paper Title
VLAB
0.61
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MA-LMM
0.606
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
MaMMUT (ours)
.602
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VAST
0.60
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
0.60
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR
0.60
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
mPLUG-2
0.581
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VideoCoCa
0.569
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
GIT
0.568
GIT: A Generative Image-to-text Transformer for Vision and Language
FrozenBiLM+
0.558
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
HiTeA
0.556
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
InternVideo
0.555
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
UMT-L (ViT-L/16)
0.552
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
0.549
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
MuLTI
0.547
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
VIOLETv2
0.547
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
X2-VLM (large)
0.546
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
0.528
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Clover
0.524
Clover: Towards A Unified Video-Language Alignment and Fusion Model
VIOLET + MELTR
0.517
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 36 row(s) selected.
Previous
Next
Visual Question Answering On Msvd Qa 1 | SOTA | HyperAI