Video Based Generative Performance
Métriques
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
Résultats
Résultats de performance de divers modèles sur ce benchmark
Tableau comparatif
Nom du modèle | Consistency | Contextual Understanding | Correctness of Information | Detail Orientation | Temporal Understanding | mean |
---|---|---|---|---|---|---|
mvbench-a-comprehensive-multi-modal-video | 2.84 | 3.72 | 3.40 | 2.91 | 2.65 | 3.10 |
one-for-all-video-conversation-is-feasible | 2.2 | 2.89 | 2.16 | 2.46 | 2.13 | 2.46 |
ts-llava-constructing-visual-tokens-through | - | - | - | - | - | 3.38 |
llama-vid-an-image-is-worth-2-tokens-in-large | 2.51 | 3.53 | 2.96 | 3.00 | 2.46 | 2.89 |
llama-vid-an-image-is-worth-2-tokens-in-large | 2.63 | 3.60 | 3.07 | 3.05 | 2.58 | 2.99 |
llama-adapter-v2-parameter-efficient-visual | 2.15 | 2.30 | 2.03 | 2.32 | 1.98 | 2.16 |
one-for-all-video-conversation-is-feasible | 2.46 | 3.27 | 2.68 | 2.69 | 2.34 | 2.69 |
tuning-large-multimodal-models-for-videos | 3.32 | 4 | 3.63 | 3.25 | 3.23 | 3.49 |
mvbench-a-comprehensive-multi-modal-video | 2.81 | 3.51 | 3.02 | 2.88 | 2.66 | 2.98 |
cat-enhancing-multimodal-large-language-model | 2.89 | 3.49 | 3.08 | 2.95 | 2.81 | 3.07 |
ppllava-varied-video-sequence-understanding | 3.20 | 3.88 | 3.32 | 3.20 | 3.0 | 3.32 |
pllava-parameter-free-llava-extension-from-1 | 3.25 | 3.90 | 3.60 | 3.20 | 2.67 | 3.32 |
videogpt-integrating-image-and-video-encoders | 3.39 | 3.74 | 3.27 | 3.18 | 2.83 | 3.28 |
chat-univi-unified-visual-representation | 2.81 | 3.46 | 2.89 | 2.91 | 2.39 | 2.99 |
lita-language-instructed-temporal | 3.19 | 3.43 | 2.94 | 2.98 | 2.68 | 3.04 |
ppllava-varied-video-sequence-understanding | 3.81 | 4.21 | 3.85 | 3.56 | 3.21 | 3.73 |
videochat-chat-centric-video-understanding | 2.24 | 2.53 | 2.23 | 2.50 | 1.94 | 2.29 |
slowfast-llava-a-strong-training-free | - | - | - | - | - | 3.32 |
st-llm-large-language-models-are-effective-1 | 2.81 | 3.74 | 3.23 | 3.05 | 2.93 | 3.15 |
video-llama-an-instruction-tuned-audio-visual | 1.79 | 2.16 | 1.96 | 2.18 | 1.82 | 1.98 |
vtimellm-empower-llm-to-grasp-video-moments | 2.47 | 3.40 | 2.78 | 3.10 | 2.49 | 2.85 |
video-chatgpt-towards-detailed-video | 2.37 | 2.62 | 2.4 | 2.52 | 1.98 | 2.38 |
an-image-grid-can-be-worth-a-video-zero-shot | 3.13 | 3.61 | 3.40 | 2.80 | 2.89 | 3.17 |