HyperAI
HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
الأسئلة والإجابات المرئية
Video Question Answering On Next Qa
Video Question Answering On Next Qa
المقاييس
Accuracy
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Accuracy
Paper Title
Repository
LLaVA-Video
83.2
Video Instruction Tuning With Synthetic Data
-
LLaVA-NeXT-Interleave(14B)
79.1
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
-
ATM
58.3
ATM: Action Temporality Modeling for Video Question Answering
-
VideoChat2_HD_mistral
79.5
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
-
ViperGPT(0-shot)
60.0
ViperGPT: Visual Inference via Python Execution for Reasoning
-
LongVILA(7B)
80.7
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
-
VGT(PT)
56.9
Video Graph Transformer for Video Question Answering
-
TCR
73.5
Text-Conditioned Resampler For Long Form Video Understanding
-
ViLA (3B)
75.6
ViLA: Efficient Video-Language Alignment for Video Question Answering
-
HiTeA
63.1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
HQGA
51.4
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
-
RTQ
63.2
RTQ: Rethinking Video-language Understanding Based on Image-text Model
-
GF
58.83
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
-
LSTP
72.1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
-
LLaMA-VQA (33B)
75.5
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
-
CoVGT(PT)
60.7
Contrastive Video Question Answering via Video Graph Transformer
-
SeViT
60.6
Semi-Parametric Video-Grounded Text Generation
-
VideoChat2_mistral
78.6
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
-
Vamos
77.3
Vamos: Versatile Action Models for Video Understanding
-
LinVT-Qwen2-VL (7B)
85.5
LinVT: Empower Your Image-level Large Language Model to Understand Videos
-
0 of 44 row(s) selected.
Previous
Next
Video Question Answering On Next Qa | SOTA | HyperAI