HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
Video Question Answering
Video Question Answering On Situated
Video Question Answering On Situated
المقاييس
Average Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Average Accuracy
Paper Title
Repository
MIST
51.13
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
TraveLER (0-shot)
44.9
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
SHG-VQA (trained from scratch)
39.47
Learning Situation Hyper-Graphs for Video Question Answering
Flamingo-9B (4-shot)
42.8
Flamingo: a Visual Language Model for Few-Shot Learning
SeViLA
64.9
Self-Chained Image-Language Model for Video Localization and Question Answering
All-in-one
47.5
All in One: Exploring Unified Video-Language Pre-training
GF(sup)
53.94
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
VLAP (4 frames)
67.1
ViLA: Efficient Video-Language Alignment for Video Question Answering
SeViLA (0-shot)
44.6
Self-Chained Image-Language Model for Video Localization and Question Answering
Flamingo-80B (0-shot)
39.7
Flamingo: a Visual Language Model for Few-Shot Learning
LLaMA-VQA
65.4
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
InternVideo
58.7
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Flamingo-9B (0-shot)
41.8
Flamingo: a Visual Language Model for Few-Shot Learning
Temp[ATP]
48.37
Revisiting the "Video" in Video-Language Understanding
AnyMAL-70B (0-shot)
48.2
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Flamingo-80B (4-shot)
42.4
Flamingo: a Visual Language Model for Few-Shot Learning
GF(uns)
53.86
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
0 of 17 row(s) selected.
Previous
Next