Home News Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Zero Shot Video Question Answer On Video Mme

Metrics

Accuracy (%)

Results

Performance results of various models on this benchmark

Model Name	Accuracy (%)	Paper Title	Repository
Gemini 1.5 Flash	66.3	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4o mini	62.3	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	-
VideoLLaMA2 (72B)	60.9	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VILA-1.5 (34B)	61.4	VILA: On Pre-training for Visual Language Models
Video-RAG (based on LLaVA-Video)	77.4	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
LLaVA-OneVision (72B)	64.8	-	-
Gemini 1.5 Pro	71.9	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4o	70.3	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	-

0 of 8 row(s) selected.