Zero Shot Video Question Answer On Video Mme
Métriques
Accuracy (%)
Résultats
Résultats de performance de divers modèles sur ce benchmark
Nom du modèle | Accuracy (%) | Paper Title | Repository |
---|---|---|---|
Gemini 1.5 Flash | 66.3 | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | |
GPT-4o mini | 62.3 | GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | - |
VideoLLaMA2 (72B) | 60.9 | VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | |
VILA-1.5 (34B) | 61.4 | VILA: On Pre-training for Visual Language Models | |
Video-RAG (based on LLaVA-Video) | 77.4 | Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | |
LLaVA-OneVision (72B) | 64.8 | - | - |
Gemini 1.5 Pro | 71.9 | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | |
GPT-4o | 70.3 | GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | - |
0 of 8 row(s) selected.