Zero Shot Video Question Answer On Next Qa

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名	Accuracy	Paper Title	Repository
VideoChat2	61.7	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VidCtx (7B)	70.7	VidCtx: Context-aware Video Question Answering with Image Models
LongVA(32 frames)	67.1	Long Context Transfer from Language to Vision
IG-VLM(LLaVA v1.6)	70.9	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
MVU (13B)	55.2	Understanding Long Videos with Multimodal Language Models
Q-ViD	66.3	Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
VideoTree (GPT4)	73.5	VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Sevila (4B)	63.6	Self-Chained Image-Language Model for Video Localization and Question Answering
TraveLER (GPT-4)	68.2	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
ProViQ	64.6	Zero-Shot Video Question Answering with Procedural Programs	-
Tarsier (34B)	79.2	Tarsier: Recipes for Training and Evaluating Large Video Description Models
ViperGPT (GPT-3.5)	60.0	ViperGPT: Visual Inference via Python Execution for Reasoning
LLoVi (GPT-4)	67.7	A Simple LLM Framework for Long-Range Video Question-Answering
MoReVQA(PaLM-2)	69.2	MoReVQA: Exploring Modular Reasoning Models for Video Question Answering	-
IG-VLM (GPT-4)	68.6	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
DeepStack-L(7B)	61.0	DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs	-
VideoAgent (GPT-4)	71.3	VideoAgent: Long-form Video Understanding with Large Language Model as Agent
LVNet(GPT-4o)	72.9	Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
LLoVi (7B)	54.3	A Simple LLM Framework for Long-Range Video Question-Answering
TS-LLaVA-34B	73.6	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

0 of 25 row(s) selected.