HyperAI
HyperAI
الرئيسية
المنصة
الوثائق
الأخبار
الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
شروط الخدمة
سياسة الخصوصية
العربية
HyperAI
HyperAI
Toggle Sidebar
البحث في الموقع...
⌘
K
Command Palette
Search for a command to run...
المنصة
الرئيسية
SOTA
الأسئلة والإجابات على الفيديو بدون تدريب مسبق
Zeroshot Video Question Answer On Activitynet
Zeroshot Video Question Answer On Activitynet
المقاييس
Accuracy
Confidence Score
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Accuracy
Confidence Score
Paper Title
Tarsier (34B)
61.6
3.7
Tarsier: Recipes for Training and Evaluating Large Video Description Models
PLLaVA (34B)
60.9
3.7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PPLLaVA-7B
60.7
3.6
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
LinVT-Qwen2-VL(7B)
60.1
3.6
LinVT: Empower Your Image-level Large Language Model to Understand Videos
SlowFast-LLaVA-34B
59.2
3.5
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
TS-LLaVA-34B
58.9
3.5
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
IG-VLM
58.4
3.5
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
LLaVA-Mini
53.5
3.5
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Flash-VStream
51.9
3.4
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
ST-LLM
50.9
3.3
ST-LLM: Large Language Models Are Effective Temporal Learners
VideoGPT+
50.6
3.6
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
CAT-7B
50.2
3.5
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Video-LaVIT
50.1
3.3
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
VideoChat2
49.1
3.3
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
LLaMA-VID-13B (2 Token)
47.5
3.3
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)
47.4
3.3
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi-13B
46.4
3.6
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MiniGPT4-video-7B
46.3
-
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
BT-Adapter (zero-shot)
46.1
3.2
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Chat-UniVi
46.1
3.3
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
0 of 28 row(s) selected.
Previous
Next
Zeroshot Video Question Answer On Activitynet | SOTA | HyperAI