Video Based Generative Performance

Métriques

Consistency

Contextual Understanding

Correctness of Information

Detail Orientation

Temporal Understanding

mean

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle	Consistency	Contextual Understanding	Correctness of Information	Detail Orientation	Temporal Understanding	mean	Paper Title
VideoChat2_HD_mistral	2.84	3.72	3.40	2.91	2.65	3.10	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
BT-Adapter (zero-shot)	2.2	2.89	2.16	2.46	2.13	2.46	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
TS-LLaVA-34B	-	-	-	-	-	3.38	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
LLaMA-VID-7B (2 Token)	2.51	3.53	2.96	3.00	2.46	2.89	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-13B (2 Token)	2.63	3.60	3.07	3.05	2.58	2.99	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA Adapter	2.15	2.30	2.03	2.32	1.98	2.16	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
BT-Adapter	2.46	3.27	2.68	2.69	2.34	2.69	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
VLM-RLAIF	3.32	4	3.63	3.25	3.23	3.49	Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
VideoChat2	2.81	3.51	3.02	2.88	2.66	2.98	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CAT-7B	2.89	3.49	3.08	2.95	2.81	3.07	CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
PPLLaVA-7B	3.20	3.88	3.32	3.20	3.0	3.32	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
PLLaVA-34B	3.25	3.90	3.60	3.20	2.67	3.32	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
VideoGPT+	3.39	3.74	3.27	3.18	2.83	3.28	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Chat-UniVi	2.81	3.46	2.89	2.91	2.39	2.99	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LITA-13B	3.19	3.43	2.94	2.98	2.68	3.04	LITA: Language Instructed Temporal-Localization Assistant
PPLLaVA-7B-dpo	3.81	4.21	3.85	3.56	3.21	3.73	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Video Chat	2.24	2.53	2.23	2.50	1.94	2.29	VideoChat: Chat-Centric Video Understanding
SlowFast-LLaVA-34B	-	-	-	-	-	3.32	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
ST-LLM-7B	2.81	3.74	3.23	3.05	2.93	3.15	ST-LLM: Large Language Models Are Effective Temporal Learners
Video LLaMA	1.79	2.16	1.96	2.18	1.82	1.98	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

0 of 23 row(s) selected.