HyperAI

Zeroshot Video Question Answer On Msrvtt Qa

Métriques

Accuracy
Confidence Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleAccuracyConfidence Score
chat-univi-unified-visual-representation55.03.1
ts-llava-constructing-visual-tokens-through66.23.6
one-for-all-video-conversation-is-feasible51.22.9
video-llava-learning-united-visual-159.23.5
llama-vid-an-image-is-worth-2-tokens-in-large57.73.2
an-image-grid-can-be-worth-a-video-zero-shot63.83.5
omnidatacomposer-a-unified-data-structure-for55.33.3
elysium-exploring-object-level-perception-in67.53.2
moviechat-from-dense-token-to-sparse-memory52.72.6
shot2story20k-a-new-benchmark-for56.8-
cat-enhancing-multimodal-large-language-model62.13.5
one-for-all-video-conversation-is-feasible51.22.9
mvbench-a-comprehensive-multi-modal-video54.13.3
vista-llama-reliable-video-narrator-via-equal60.53.3
tarsier-recipes-for-training-and-evaluating-166.43.7
video-lavit-unified-video-language-pre59.33.3
videochat-chat-centric-video-understanding45.02.5
videogpt-integrating-image-and-video-encoders60.63.6
video-chatgpt-towards-detailed-video49.32.8
pllava-parameter-free-llava-extension-from-168.73.6
slowfast-llava-a-strong-training-free67.43.7
linvt-empower-your-image-level-large-language66.24.0
flash-vstream-memory-based-real-time72.43.4
ppllava-varied-video-sequence-understanding64.33.5
llava-mini-efficient-image-and-video-large59.53.6
st-llm-large-language-models-are-effective-163.23.4
video-llama-an-instruction-tuned-audio-visual29.61.8
llama-vid-an-image-is-worth-2-tokens-in-large58.93.3
minigpt4-video-advancing-multimodal-llms-for59.73-
llama-adapter-v2-parameter-efficient-visual43.82.7