Zeroshot Video Question Answer On Msrvtt Qa
Metriken
Accuracy
Confidence Score
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | Accuracy | Confidence Score |
---|---|---|
chat-univi-unified-visual-representation | 55.0 | 3.1 |
ts-llava-constructing-visual-tokens-through | 66.2 | 3.6 |
one-for-all-video-conversation-is-feasible | 51.2 | 2.9 |
video-llava-learning-united-visual-1 | 59.2 | 3.5 |
llama-vid-an-image-is-worth-2-tokens-in-large | 57.7 | 3.2 |
an-image-grid-can-be-worth-a-video-zero-shot | 63.8 | 3.5 |
omnidatacomposer-a-unified-data-structure-for | 55.3 | 3.3 |
elysium-exploring-object-level-perception-in | 67.5 | 3.2 |
moviechat-from-dense-token-to-sparse-memory | 52.7 | 2.6 |
shot2story20k-a-new-benchmark-for | 56.8 | - |
cat-enhancing-multimodal-large-language-model | 62.1 | 3.5 |
one-for-all-video-conversation-is-feasible | 51.2 | 2.9 |
mvbench-a-comprehensive-multi-modal-video | 54.1 | 3.3 |
vista-llama-reliable-video-narrator-via-equal | 60.5 | 3.3 |
tarsier-recipes-for-training-and-evaluating-1 | 66.4 | 3.7 |
video-lavit-unified-video-language-pre | 59.3 | 3.3 |
videochat-chat-centric-video-understanding | 45.0 | 2.5 |
videogpt-integrating-image-and-video-encoders | 60.6 | 3.6 |
video-chatgpt-towards-detailed-video | 49.3 | 2.8 |
pllava-parameter-free-llava-extension-from-1 | 68.7 | 3.6 |
slowfast-llava-a-strong-training-free | 67.4 | 3.7 |
linvt-empower-your-image-level-large-language | 66.2 | 4.0 |
flash-vstream-memory-based-real-time | 72.4 | 3.4 |
ppllava-varied-video-sequence-understanding | 64.3 | 3.5 |
llava-mini-efficient-image-and-video-large | 59.5 | 3.6 |
st-llm-large-language-models-are-effective-1 | 63.2 | 3.4 |
video-llama-an-instruction-tuned-audio-visual | 29.6 | 1.8 |
llama-vid-an-image-is-worth-2-tokens-in-large | 58.9 | 3.3 |
minigpt4-video-advancing-multimodal-llms-for | 59.73 | - |
llama-adapter-v2-parameter-efficient-visual | 43.8 | 2.7 |