Zero Shot Video Question Answer On Egoschema 1

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名	Accuracy	Paper Title	Repository
TimeChat (7B)	33.0	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
VideoChat2_mistral	54.4	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Tarsier (34B)	61.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
LLoVi (GPT-3.5)	50.3	A Simple LLM Framework for Long-Range Video Question-Answering
InternVideo	32.1	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoTree (GPT4)	61.1	VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Vamos (GPT-4o)	53.6	Vamos: Versatile Action Models for Video Understanding
MVU (13B)	37.6	Understanding Long Videos with Multimodal Language Models
SeViLA (4B)	22.7	Self-Chained Image-Language Model for Video Localization and Question Answering
LVNet	61.1	Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
BIMBA-LLaVA-Qwen2-7B	71.14	BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Random	20.0	-	-
Vamos (GPT-4)	48.3	Vamos: Versatile Action Models for Video Understanding
VideoChat2_HD_mistral	55.8	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoLLaMA2 (72B)	63.9	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Video-RAG (Based on LLaVA-Video)	66.7	Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
mPLUG-Owl (7B)	31.1	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
LinVT-Qwen2-VL(7B)	69.5	LinVT: Empower Your Image-level Large Language Model to Understand Videos
TraveLER	53.3	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
VideoChat2_phi3	56.7	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

0 of 27 row(s) selected.