Zeroshot Video Question Answer On Msvd Qa
Métriques
Accuracy
Confidence Score
Résultats
Résultats de performance de divers modèles sur ce benchmark
Tableau comparatif
Nom du modèle | Accuracy | Confidence Score |
---|---|---|
one-for-all-video-conversation-is-feasible | 67.0 | 3.6 |
videogpt-integrating-image-and-video-encoders | 72.4 | 3.6 |
slowfast-llava-a-strong-training-free | 79.9 | 4.1 |
video-chatgpt-towards-detailed-video | 64.9 | 3.3 |
ppllava-varied-video-sequence-understanding | 77.1 | 4.0 |
elysium-exploring-object-level-perception-in | 75.8 | 3.7 |
one-for-all-video-conversation-is-feasible | 67.0 | 3.6 |
videochat-chat-centric-video-understanding | 56.3 | 2.8 |
flash-vstream-memory-based-real-time | 80.3 | 3.9 |
vila-on-pre-training-for-visual-language | 80.1 | - |
llava-mini-efficient-image-and-video-large | 70.9 | 4.0 |
ts-llava-constructing-visual-tokens-through | 79.4 | 4.1 |
an-image-grid-can-be-worth-a-video-zero-shot | 79.6 | 4.1 |
llama-vid-an-image-is-worth-2-tokens-in-large | 69.7 | 3.7 |
mvbench-a-comprehensive-multi-modal-video | 70.0 | 3.9 |
video-llama-an-instruction-tuned-audio-visual | 51.6 | 2.5 |
linvt-empower-your-image-level-large-language | 80.2 | 4.4 |
pllava-parameter-free-llava-extension-from-1 | 79.9 | 4.2 |
llama-adapter-v2-parameter-efficient-visual | 54.9 | 3.1 |
zero-shot-video-question-answering-via-frozen | 33.8 | - |
st-llm-large-language-models-are-effective-1 | 74.6 | 3.9 |
video-llava-learning-united-visual-1 | 70.7 | 3.9 |
tarsier-recipes-for-training-and-evaluating-1 | 80.3 | 4.2 |
video-lavit-unified-video-language-pre | 73.2 | 3.9 |
moviechat-from-dense-token-to-sparse-memory | 75.2 | 2.9 |
llama-vid-an-image-is-worth-2-tokens-in-large | 70.0 | 3.7 |
minigpt4-video-advancing-multimodal-llms-for | 73.92 | - |
chat-univi-unified-visual-representation | 69.3 | 3.7 |