HyperAI
HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
الأسئلة والإجابات المرئية
Video Question Answering On Activitynet Qa
Video Question Answering On Activitynet Qa
المقاييس
Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Accuracy
Paper Title
Repository
LocVLM-Vid-B+
38.2
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
-
E-MN
27.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
-
VindLU
44.7
VindLU: A Recipe for Effective Video-and-Language Pretraining
-
Video-LLaVA
45.3
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
-
E-VQA
25.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
-
VALOR
48.6
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
-
E-SA
31.8
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
-
BT-Adapter (zero-shot)
46.1
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
-
Mirasol3B
51.13
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
-
Chat-UniVi-13B
46.4
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
-
MA-LMM
49.8
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
-
FrozenBiLM+
44.8
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
-
MovieChat
45.7
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
-
Video-ChatGPT
35.2
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
-
LLaMA-VID-7B (2 Token)
47.4
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
-
VAST
50.4
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
-
TESTA (ViT-B/16)
45
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
-
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
61.2
Composing Ensembles of Pre-trained Models via Iterative Consensus
-
LocVLM-Vid-B
37.4
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
-
VideoCoCa
56.1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
0 of 36 row(s) selected.
Previous
Next
Video Question Answering On Activitynet Qa | SOTA | HyperAI