Zero Shot Video Question Answer On Next Qa
Metriken
Accuracy
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | Accuracy |
---|---|
mvbench-a-comprehensive-multi-modal-video | 61.7 |
vidctx-context-aware-video-question-answering | 70.7 |
long-context-transfer-from-language-to-vision | 67.1 |
an-image-grid-can-be-worth-a-video-zero-shot | 70.9 |
understanding-long-videos-in-one-multimodal | 55.2 |
question-instructed-visual-descriptions-for | 66.3 |
videotree-adaptive-tree-based-video | 73.5 |
self-chained-image-language-model-for-video-1 | 63.6 |
traveler-a-multi-lmm-agent-framework-for | 68.2 |
zero-shot-video-question-answering-with | 64.6 |
tarsier-recipes-for-training-and-evaluating-1 | 79.2 |
vipergpt-visual-inference-via-python | 60.0 |
a-simple-llm-framework-for-long-range-video | 67.7 |
morevqa-exploring-modular-reasoning-models | 69.2 |
an-image-grid-can-be-worth-a-video-zero-shot | 68.6 |
deepstack-deeply-stacking-visual-tokens-is | 61.0 |
videoagent-long-form-video-understanding-with | 71.3 |
too-many-frames-not-all-useful-efficient | 72.9 |
a-simple-llm-framework-for-long-range-video | 54.3 |
ts-llava-constructing-visual-tokens-through | 73.6 |
enter-event-based-interpretable-reasoning-for | 75.1 |
mistral-7b | 51.1 |
slowfast-llava-a-strong-training-free | 64.2 |
language-repository-for-long-video | 60.9 |
verbs-in-action-improving-verb-understanding | 51.5 |