Zero Shot Video Question Answer On Video Mme

Métriques

Accuracy (%)

Résultats

Résultats de performance de divers modèles sur ce benchmark

		Paper Title
Video-RAG (based on LLaVA-Video)	77.4	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Gemini 1.5 Pro	71.9	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4o	70.3	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
Gemini 1.5 Flash	66.3	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
LLaVA-OneVision (72B)	64.8	-
GPT-4o mini	62.3	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
VILA-1.5 (34B)	61.4	VILA: On Pre-training for Visual Language Models
VideoLLaMA2 (72B)	60.9	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

0 of 8 row(s) selected.