HyperAI

Video Question Answering On Next Qa

Métriques

Accuracy

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleAccuracy
video-instruction-tuning-with-synthetic-data83.2
llava-next-interleave-tackling-multi-image79.1
atm-action-temporality-modeling-for-video58.3
mvbench-a-comprehensive-multi-modal-video79.5
vipergpt-visual-inference-via-python60.0
longvila-scaling-long-context-visual-language80.7
video-graph-transformer-for-video-question56.9
text-conditioned-resampler-for-long-form73.5
vlap-efficient-video-language-alignment-via75.6
hitea-hierarchical-temporal-aware-video63.1
video-as-conditional-graph-hierarchy-for51.4
rtq-rethinking-video-language-understanding63.2
glance-and-focus-memory-prompting-for-multi-158.83
lstp-language-guided-spatial-temporal-prompt72.1
large-language-models-are-temporal-and-causal75.5
contrastive-video-question-answering-via60.7
semi-parametric-video-grounded-text60.6
mvbench-a-comprehensive-multi-modal-video78.6
vamos-versatile-action-models-for-video77.3
linvt-empower-your-image-level-large-language85.5
mirasol3b-a-multimodal-autoregressive-model72
qwen2-vl-enhancing-vision-language-model-s81.2
llava-next-interleave-tackling-multi-image78.2
mist-multi-modal-iterative-spatial-temporal57.2
videollama-2-advancing-spatial-temporal75.6
paxion-patching-action-knowledge-in-video-156.9
expanding-performance-boundaries-of-open85.5
llava-next-interleave-tackling-multi-image77.9
nvila-efficient-frontier-visual-language82.2
verbs-in-action-improving-verb-understanding58.6
video-graph-transformer-for-video-question55.0
self-chained-image-language-model-for-video-173.8
contrastive-video-question-answering-via60.0
llava-onevision-easy-visual-task-transfer79.4
revisiting-the-video-in-video-language54.3
vlap-efficient-video-language-alignment-via74.4
mvbench-a-comprehensive-multi-modal-video68.6
mplug-owl3-towards-long-image-sequence78.6
oryx-mllm-on-demand-spatial-temporal81.8
llava-onevision-easy-visual-task-transfer80.2
2-5-1-d-spatio-temporal-scene-graphs-for53.4
crema-multimodal-compositional-video73.9
videollama-3-frontier-multimodal-foundation84.5
bimba-selective-scan-compression-for-long83.73