Zeroshot Video Question Answer On Msrvtt Qa

평가 지표

Accuracy

Confidence Score

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Accuracy	Confidence Score	Paper Title	Repository
Chat-UniVi-7B	55.0	3.1	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
TS-LLaVA-34B	66.2	3.6	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
BT-Adapter (zero-shot)	51.2	2.9	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Video-LLaVA-7B	59.2	3.5	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaMA-VID-7B (2 Token)	57.7	3.2	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
IG-VLM	63.8	3.5	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Omni-VideoAssistant	55.3	3.3	OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
Elysium	67.5	3.2	Elysium: Exploring Object-level Perception in Videos via MLLM
MovieChat	52.7	2.6	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
SUM-shot+Vicuna	56.8	-	Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
CAT-7B	62.1	3.5	CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
BT-Adapter (zero-shot)	51.2	2.9	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
VideoChat2	54.1	3.3	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vista-LLaMA-7B	60.5	3.3	Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens	-
Tarsier (34B)	66.4	3.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
Video-LaVIT	59.3	3.3	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Video Chat-7B	45.0	2.5	VideoChat: Chat-Centric Video Understanding
VideoGPT+	60.6	3.6	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Video-ChatGPT-7B	49.3	2.8	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
PLLaVA (34B)	68.7	3.6	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

0 of 30 row(s) selected.