HyperAI

Video Based Generative Performance

Métriques

Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
Paper TitleRepository
VideoChat2_HD_mistral2.843.723.402.912.653.10MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
BT-Adapter (zero-shot)2.22.892.162.462.132.46BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
TS-LLaVA-34B-----3.38TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
LLaMA-VID-7B (2 Token)2.513.532.963.002.462.89LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-13B (2 Token)2.633.603.073.052.582.99LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA Adapter2.152.302.032.321.982.16LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
BT-Adapter2.463.272.682.692.342.69BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
VLM-RLAIF3.3243.633.253.233.49Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback-
VideoChat22.813.513.022.882.662.98MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CAT-7B2.893.493.082.952.813.07CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
PPLLaVA-7B3.203.883.323.203.03.32PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
PLLaVA-34B3.253.903.603.202.673.32PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
VideoGPT+3.393.743.273.182.833.28VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Chat-UniVi2.813.462.892.912.392.99Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LITA-13B3.193.432.942.982.683.04LITA: Language Instructed Temporal-Localization Assistant
PPLLaVA-7B-dpo3.814.213.853.563.213.73PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
Video Chat2.242.532.232.501.942.29VideoChat: Chat-Centric Video Understanding
SlowFast-LLaVA-34B-----3.32SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
ST-LLM-7B2.813.743.233.052.933.15ST-LLM: Large Language Models Are Effective Temporal Learners
Video LLaMA1.792.161.962.181.821.98Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
0 of 23 row(s) selected.