HyperAI

Video Based Generative Performance

Métriques

Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleConsistencyContextual UnderstandingCorrectness of InformationDetail OrientationTemporal Understandingmean
mvbench-a-comprehensive-multi-modal-video2.843.723.402.912.653.10
one-for-all-video-conversation-is-feasible2.22.892.162.462.132.46
ts-llava-constructing-visual-tokens-through-----3.38
llama-vid-an-image-is-worth-2-tokens-in-large2.513.532.963.002.462.89
llama-vid-an-image-is-worth-2-tokens-in-large2.633.603.073.052.582.99
llama-adapter-v2-parameter-efficient-visual2.152.302.032.321.982.16
one-for-all-video-conversation-is-feasible2.463.272.682.692.342.69
tuning-large-multimodal-models-for-videos3.3243.633.253.233.49
mvbench-a-comprehensive-multi-modal-video2.813.513.022.882.662.98
cat-enhancing-multimodal-large-language-model2.893.493.082.952.813.07
ppllava-varied-video-sequence-understanding3.203.883.323.203.03.32
pllava-parameter-free-llava-extension-from-13.253.903.603.202.673.32
videogpt-integrating-image-and-video-encoders3.393.743.273.182.833.28
chat-univi-unified-visual-representation2.813.462.892.912.392.99
lita-language-instructed-temporal3.193.432.942.982.683.04
ppllava-varied-video-sequence-understanding3.814.213.853.563.213.73
videochat-chat-centric-video-understanding2.242.532.232.501.942.29
slowfast-llava-a-strong-training-free-----3.32
st-llm-large-language-models-are-effective-12.813.743.233.052.933.15
video-llama-an-instruction-tuned-audio-visual1.792.161.962.181.821.98
vtimellm-empower-llm-to-grasp-video-moments2.473.402.783.102.492.85
video-chatgpt-towards-detailed-video2.372.622.42.521.982.38
an-image-grid-can-be-worth-a-video-zero-shot3.133.613.402.802.893.17