HyperAI

Zero Shot Video Question Answer On Egoschema 1

المقاييس

Accuracy

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

اسم النموذج
Accuracy
Paper TitleRepository
TimeChat (7B)33.0TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
VideoChat2_mistral54.4MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Tarsier (34B)61.7Tarsier: Recipes for Training and Evaluating Large Video Description Models
LLoVi (GPT-3.5)50.3A Simple LLM Framework for Long-Range Video Question-Answering
InternVideo32.1InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoTree (GPT4)61.1VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Vamos (GPT-4o)53.6Vamos: Versatile Action Models for Video Understanding-
MVU (13B)37.6Understanding Long Videos with Multimodal Language Models
SeViLA (4B)22.7Self-Chained Image-Language Model for Video Localization and Question Answering
LVNet61.1Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
BIMBA-LLaVA-Qwen2-7B71.14BIMBA: Selective-Scan Compression for Long-Range Video Question Answering-
Random20.0--
Vamos (GPT-4)48.3Vamos: Versatile Action Models for Video Understanding-
VideoChat2_HD_mistral55.8MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoLLaMA2 (72B)63.9VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Video-RAG (Based on LLaVA-Video)66.7Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
mPLUG-Owl (7B)31.1mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
LinVT-Qwen2-VL(7B)69.5LinVT: Empower Your Image-level Large Language Model to Understand Videos
TraveLER53.3TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
VideoChat2_phi356.7MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
0 of 27 row(s) selected.