HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
الأسئلة المرئية والإجابة عليها (VQA)
Visual Question Answering On Msrvtt Qa 1
Visual Question Answering On Msrvtt Qa 1
المقاييس
Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Accuracy
Paper Title
VLAB
0.496
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
MaMMUT
0.495
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
mPLUG-2
0.480
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
MuLTI
0.478
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Flamingo
0.474
Flamingo: a Visual Language Model for Few-Shot Learning
UMT-L (ViT-L/16)
0.471
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
InternVideo
0.471
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
vid-TLDR (UMT-L)
0.470
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
FrozenBiLM+
0.470
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
VideoCoCa
0.463
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
HBI
0.462
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
HiTeA
0.459
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
EMCL-Net
0.458
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Co-Tokenization
.457
Video Question Answering with Iterative Video-Text Co-Tokenization
X2-VLM (large)
0.455
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
0.45
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
All-in-one-B
0.443
All in One: Exploring Unified Video-Language Pre-training
Clover
0.441
Clover: Towards A Unified Video-Language Alignment and Fusion Model
OmniVL
0.441
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
AIO+MIF
0.440
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
0 of 34 row(s) selected.
Previous
Next
Visual Question Answering On Msrvtt Qa 1 | SOTA | HyperAI