Video Question Answering On Next Qa
Métriques
Accuracy
Résultats
Résultats de performance de divers modèles sur ce benchmark
Tableau comparatif
Nom du modèle | Accuracy |
---|---|
video-instruction-tuning-with-synthetic-data | 83.2 |
llava-next-interleave-tackling-multi-image | 79.1 |
atm-action-temporality-modeling-for-video | 58.3 |
mvbench-a-comprehensive-multi-modal-video | 79.5 |
vipergpt-visual-inference-via-python | 60.0 |
longvila-scaling-long-context-visual-language | 80.7 |
video-graph-transformer-for-video-question | 56.9 |
text-conditioned-resampler-for-long-form | 73.5 |
vlap-efficient-video-language-alignment-via | 75.6 |
hitea-hierarchical-temporal-aware-video | 63.1 |
video-as-conditional-graph-hierarchy-for | 51.4 |
rtq-rethinking-video-language-understanding | 63.2 |
glance-and-focus-memory-prompting-for-multi-1 | 58.83 |
lstp-language-guided-spatial-temporal-prompt | 72.1 |
large-language-models-are-temporal-and-causal | 75.5 |
contrastive-video-question-answering-via | 60.7 |
semi-parametric-video-grounded-text | 60.6 |
mvbench-a-comprehensive-multi-modal-video | 78.6 |
vamos-versatile-action-models-for-video | 77.3 |
linvt-empower-your-image-level-large-language | 85.5 |
mirasol3b-a-multimodal-autoregressive-model | 72 |
qwen2-vl-enhancing-vision-language-model-s | 81.2 |
llava-next-interleave-tackling-multi-image | 78.2 |
mist-multi-modal-iterative-spatial-temporal | 57.2 |
videollama-2-advancing-spatial-temporal | 75.6 |
paxion-patching-action-knowledge-in-video-1 | 56.9 |
expanding-performance-boundaries-of-open | 85.5 |
llava-next-interleave-tackling-multi-image | 77.9 |
nvila-efficient-frontier-visual-language | 82.2 |
verbs-in-action-improving-verb-understanding | 58.6 |
video-graph-transformer-for-video-question | 55.0 |
self-chained-image-language-model-for-video-1 | 73.8 |
contrastive-video-question-answering-via | 60.0 |
llava-onevision-easy-visual-task-transfer | 79.4 |
revisiting-the-video-in-video-language | 54.3 |
vlap-efficient-video-language-alignment-via | 74.4 |
mvbench-a-comprehensive-multi-modal-video | 68.6 |
mplug-owl3-towards-long-image-sequence | 78.6 |
oryx-mllm-on-demand-spatial-temporal | 81.8 |
llava-onevision-easy-visual-task-transfer | 80.2 |
2-5-1-d-spatio-temporal-scene-graphs-for | 53.4 |
crema-multimodal-compositional-video | 73.9 |
videollama-3-frontier-multimodal-foundation | 84.5 |
bimba-selective-scan-compression-for-long | 83.73 |