Home News Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Zero Shot Video Question Answer On Video Mme 1

Metrics

Accuracy (%)

Results

Performance results of various models on this benchmark

Model Name	Accuracy (%)	Paper Title	Repository
GPT-4o mini	68.9	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	-
VideoLLaMA2 (72B)	63.1	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
BIMBA-LLaVA-Qwen2-7B	64.67	BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Video-RAG (Based on LLaVA-Video)	77.4	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
VILA-1.5 (34B)	64.1	VILA: On Pre-training for Visual Language Models
Gemini 1.5 Pro	81.3	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
LongVU (7B)	60.6	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
MiniCPM-V 2.6 (8B)	63.7	MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Gemini 1.5 Flash	75.0	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GPT-4o	77.2	GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	-

0 of 10 row(s) selected.