HyperAIHyperAI

Video Based Generative Performance

Métriques

Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
Paper TitleRepository
VideoChat2_HD_mistral2.843.723.402.912.653.10MVBench: A Comprehensive Multi-modal Video Understanding Benchmark-
BT-Adapter (zero-shot)2.22.892.162.462.132.46BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
TS-LLaVA-34B-----3.38TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models-
LLaMA-VID-7B (2 Token)2.513.532.963.002.462.89LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
LLaMA-VID-13B (2 Token)2.633.603.073.052.582.99LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
LLaMA Adapter2.152.302.032.321.982.16LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model-
BT-Adapter2.463.272.682.692.342.69BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
VLM-RLAIF3.3243.633.253.233.49Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback-
VideoChat22.813.513.022.882.662.98MVBench: A Comprehensive Multi-modal Video Understanding Benchmark-
CAT-7B2.893.493.082.952.813.07CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios-
PPLLaVA-7B3.203.883.323.203.03.32PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
PLLaVA-34B3.253.903.603.202.673.32PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
VideoGPT+3.393.743.273.182.833.28VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding-
Chat-UniVi2.813.462.892.912.392.99Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding-
LITA-13B3.193.432.942.982.683.04LITA: Language Instructed Temporal-Localization Assistant-
PPLLaVA-7B-dpo3.814.213.853.563.213.73PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
Video Chat2.242.532.232.501.942.29VideoChat: Chat-Centric Video Understanding-
SlowFast-LLaVA-34B-----3.32SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models-
ST-LLM-7B2.813.743.233.052.933.15ST-LLM: Large Language Models Are Effective Temporal Learners-
Video LLaMA1.792.161.962.181.821.98Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding-
0 of 23 row(s) selected.