HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
الأسئلة والإجابات المرئية
Video Question Answering On Tvbench
Video Question Answering On Tvbench
المقاييس
Average Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Average Accuracy
Paper Title
Tarsier-34B
55.5
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Tarsier2-7B
54.7
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
Qwen2-VL-72B
52.7
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
IXC-2.5 7B
51.6
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Aria
51.0
Aria: An Open Multimodal Native Mixture-of-Experts Model
LLaVA-Video 72B
50.0
Video Instruction Tuning With Synthetic Data
VideoLLaMA2 72B
48.4
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Gemini 1.5 Pro
47.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Tarsier-7B
46.9
Tarsier: Recipes for Training and Evaluating Large Video Description Models
LLaVA-Video 7B
45.6
Video Instruction Tuning With Synthetic Data
Qwen2-VL-7B
43.8
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
VideoLLaMA2 7B
42.9
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
PLLaVA-34B
42.3
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
mPLUG-Owl3
42.2
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
VideoLLaMA2.1
42.1
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoGPT+
41.7
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
GPT4o 8 frames
39.9
GPT-4o System Card
PLLaVA-13B
36.4
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
ST-LLM
35.7
ST-LLM: Large Language Models Are Effective Temporal Learners
VideoChat2
35.0
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
0 of 21 row(s) selected.
Previous
Next
Video Question Answering On Tvbench | SOTA | HyperAI