HyperAIHyperAI

Video Question Answering On Tvbench

Metrics

Average Accuracy

Results

Performance results of various models on this benchmark

Model Name
Average Accuracy
Paper TitleRepository
Aria51.0Aria: An Open Multimodal Native Mixture-of-Experts Model-
PLLaVA-34B42.3PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
mPLUG-Owl342.2mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models-
Tarsier-7B46.9Tarsier: Recipes for Training and Evaluating Large Video Description Models-
LLaVA-Video 7B45.6Video Instruction Tuning With Synthetic Data-
PLLaVA-7B34.9PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
IXC-2.5 7B51.6InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output-
Qwen2-VL-72B52.7Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution-
Qwen2-VL-7B43.8Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution-
LLaVA-Video 72B50.0Video Instruction Tuning With Synthetic Data-
ST-LLM35.7ST-LLM: Large Language Models Are Effective Temporal Learners-
GPT4o 8 frames39.9GPT-4o System Card-
VideoLLaMA2 72B48.4VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs-
PLLaVA-13B36.4PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
VideoGPT+41.7VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding-
VideoLLaMA2 7B42.9VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs-
VideoLLaMA2.142.1VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs-
Tarsier2-7B54.7Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding-
VideoChat235.0MVBench: A Comprehensive Multi-modal Video Understanding Benchmark-
Gemini 1.5 Pro47.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context-
0 of 21 row(s) selected.
Video Question Answering On Tvbench | SOTA | HyperAI