Zero Shot Video Question Answer On Video Mme 1

Métriques

Accuracy (%)

Résultats

Résultats de performance de divers modèles sur ce benchmark

		Paper Title
Gemini 1.5 Pro	81.3	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Video-RAG (Based on LLaVA-Video)	77.4	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
GPT-4o	77.2	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
Gemini 1.5 Flash	75.0	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4o mini	68.9	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
BIMBA-LLaVA-Qwen2-7B	64.67	BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
VILA-1.5 (34B)	64.1	VILA: On Pre-training for Visual Language Models
MiniCPM-V 2.6 (8B)	63.7	MiniCPM-V: A GPT-4V Level MLLM on Your Phone
VideoLLaMA2 (72B)	63.1	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
LongVU (7B)	60.6	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

0 of 10 row(s) selected.