Command Palette
Search for a command to run...
Zero Shot Video Question Answer On Video Mme 1
Métriques
Accuracy (%)
Résultats
Résultats de performance de divers modèles sur ce benchmark
| Paper Title | ||
|---|---|---|
| Gemini 1.5 Pro | 81.3 | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
| Video-RAG (Based on LLaVA-Video) | 77.4 | Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension |
| GPT-4o | 77.2 | GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding |
| Gemini 1.5 Flash | 75.0 | Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
| GPT-4o mini | 68.9 | GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding |
| BIMBA-LLaVA-Qwen2-7B | 64.67 | BIMBA: Selective-Scan Compression for Long-Range Video Question Answering |
| VILA-1.5 (34B) | 64.1 | VILA: On Pre-training for Visual Language Models |
| MiniCPM-V 2.6 (8B) | 63.7 | MiniCPM-V: A GPT-4V Level MLLM on Your Phone |
| VideoLLaMA2 (72B) | 63.1 | VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs |
| LongVU (7B) | 60.6 | LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding |
0 of 10 row(s) selected.