HyperAI

Visual Question Answering On Msvd Qa 1

Métriques

Accuracy

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleAccuracy
multi-efficient-video-and-language0.547
omnivl-one-foundation-model-for-image0.510
heterogeneous-memory-enhanced-multimodal0.337
dualvgr-a-dual-visual-graph-reasoning-unit0.390
vast-a-vision-audio-subtitle-text-omni-10.60
hierarchical-conditional-relation-networks0.361
clover-towards-a-unified-video-language0.524
all-in-one-exploring-unified-video-language0.483
sas-video-qa-self-adaptive-sampling-for0.467
internvideo-general-video-foundation-models0.555
video-text-modeling-with-zero-shot-transfer0.569
open-vocabulary-video-question-answering-a0.438
x-2-vlm-all-in-one-pre-trained-model-for0.528
tgif-qa-toward-spatio-temporal-reasoning-in0.313
mammut-a-simple-architecture-for-joint.602
unmasked-teacher-towards-training-efficient0.552
lightweight-recurrent-cross-modal-encoder-for0.478
motion-appearance-co-memory-networks-for0.317
x-2-vlm-all-in-one-pre-trained-model-for0.546
meltr-meta-loss-transformer-for-learning-to0.517
hitea-hierarchical-temporal-aware-video0.556
open-vocabulary-video-question-answering-a0.558
cosa-concatenated-sample-pretrained-vision0.60
mplug-2-a-modularized-multi-modal-foundation0.581
vid-tldr-training-free-token-merging-for0.549
sas-video-qa-self-adaptive-sampling-for0.469
open-vocabulary-video-question-answering-a0.495
ma-lmm-memory-augmented-large-multimodal0.606
valor-vision-audio-language-omni-perception0.60
align-and-prompt-video-and-language-pre0.459
noise-estimation-using-density-estimation-for0.351
video-question-answering-with-iterative-video.486
an-empirical-study-of-end-to-end-video0.547
vlab-enhancing-video-language-pre-training-by0.61
git-a-generative-image-to-text-transformer0.568
open-vocabulary-video-question-answering-a0.477