Video Question Answering On Tvbench

Metrics

Average Accuracy

Results

Performance results of various models on this benchmark

Model Name	Average Accuracy	Paper Title	Repository
Aria	51.0	Aria: An Open Multimodal Native Mixture-of-Experts Model	-
PLLaVA-34B	42.3	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	-
mPLUG-Owl3	42.2	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	-
Tarsier-7B	46.9	Tarsier: Recipes for Training and Evaluating Large Video Description Models	-
LLaVA-Video 7B	45.6	Video Instruction Tuning With Synthetic Data	-
PLLaVA-7B	34.9	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	-
IXC-2.5 7B	51.6	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	-
Qwen2-VL-72B	52.7	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	-
Qwen2-VL-7B	43.8	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	-
LLaVA-Video 72B	50.0	Video Instruction Tuning With Synthetic Data	-
ST-LLM	35.7	ST-LLM: Large Language Models Are Effective Temporal Learners	-
GPT4o 8 frames	39.9	GPT-4o System Card	-
VideoLLaMA2 72B	48.4	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	-
PLLaVA-13B	36.4	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	-
VideoGPT+	41.7	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	-
VideoLLaMA2 7B	42.9	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	-
VideoLLaMA2.1	42.1	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	-
Tarsier2-7B	54.7	Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding	-
VideoChat2	35.0	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	-
Gemini 1.5 Pro	47.6	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	-

0 of 21 row(s) selected.