Zeroshot Video Question Answer On Activitynet
Metriken
Accuracy
Confidence Score
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | Accuracy | Confidence Score |
---|---|---|
moviechat-from-dense-token-to-sparse-memory | 45.7 | 3.1 |
one-for-all-video-conversation-is-feasible | 46.1 | 3.2 |
tarsier-recipes-for-training-and-evaluating-1 | 61.6 | 3.7 |
mvbench-a-comprehensive-multi-modal-video | 49.1 | 3.3 |
chat-univi-unified-visual-representation | 46.1 | 3.3 |
llama-vid-an-image-is-worth-2-tokens-in-large | 47.5 | 3.3 |
pllava-parameter-free-llava-extension-from-1 | 60.9 | 3.7 |
an-image-grid-can-be-worth-a-video-zero-shot | 58.4 | 3.5 |
slowfast-llava-a-strong-training-free | 59.2 | 3.5 |
llava-mini-efficient-image-and-video-large | 53.5 | 3.5 |
zero-shot-video-question-answering-via-frozen | 24.7 | - |
videochat-chat-centric-video-understanding | 26.5 | 2.2 |
llama-vid-an-image-is-worth-2-tokens-in-large | 47.4 | 3.3 |
video-chatgpt-towards-detailed-video | 35.2 | 2.7 |
flash-vstream-memory-based-real-time | 51.9 | 3.4 |
video-llava-learning-united-visual-1 | 45.3 | 3.3 |
ts-llava-constructing-visual-tokens-through | 58.9 | 3.5 |
elysium-exploring-object-level-perception-in | 43.4 | 2.9 |
ppllava-varied-video-sequence-understanding | 60.7 | 3.6 |
linvt-empower-your-image-level-large-language | 60.1 | 3.6 |
chat-univi-unified-visual-representation | 46.4 | 3.6 |
video-llama-an-instruction-tuned-audio-visual | 12.4 | 1.1 |
video-lavit-unified-video-language-pre | 50.1 | 3.3 |
st-llm-large-language-models-are-effective-1 | 50.9 | 3.3 |
cat-enhancing-multimodal-large-language-model | 50.2 | 3.5 |
videogpt-integrating-image-and-video-encoders | 50.6 | 3.6 |
llama-adapter-v2-parameter-efficient-visual | 34.2 | 2.7 |
minigpt4-video-advancing-multimodal-llms-for | 46.3 | - |