Visual Question Answering On Msrvtt Qa 1
Metriken
Accuracy
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | Accuracy |
---|---|
vid-tldr-training-free-token-merging-for | 0.470 |
unmasked-teacher-towards-training-efficient | 0.471 |
less-is-more-clipbert-for-video-and-language | 0.374 |
open-vocabulary-video-question-answering-a | 0.395 |
flamingo-a-visual-language-model-for-few-shot-1 | 0.310 |
open-vocabulary-video-question-answering-a | 0.470 |
all-in-one-exploring-unified-video-language | 0.443 |
video-text-as-game-players-hierarchical | 0.462 |
motion-appearance-co-memory-networks-for | 0.32 |
video-text-modeling-with-zero-shot-transfer | 0.463 |
tgif-qa-toward-spatio-temporal-reasoning-in | 0.309 |
clover-towards-a-unified-video-language | 0.441 |
align-and-prompt-video-and-language-pre | 0.421 |
video-question-answering-with-iterative-video | .457 |
vlab-enhancing-video-language-pre-training-by | 0.496 |
heterogeneous-memory-enhanced-multimodal | 0.33 |
x-2-vlm-all-in-one-pre-trained-model-for | 0.45 |
sas-video-qa-self-adaptive-sampling-for | 0.438 |
open-vocabulary-video-question-answering-a | 0.418 |
lightweight-recurrent-cross-modal-encoder-for | 0.42 |
dualvgr-a-dual-visual-graph-reasoning-unit | 0.355 |
sas-video-qa-self-adaptive-sampling-for | 0.440 |
omnivl-one-foundation-model-for-image | 0.441 |
flamingo-a-visual-language-model-for-few-shot-1 | 0.174 |
multi-efficient-video-and-language | 0.478 |
mammut-a-simple-architecture-for-joint | 0.495 |
x-2-vlm-all-in-one-pre-trained-model-for | 0.455 |
sas-video-qa-self-adaptive-sampling-for | 0.423 |
hierarchical-conditional-relation-networks | 0.356 |
expectation-maximization-contrastive-learning | 0.458 |
internvideo-general-video-foundation-models | 0.471 |
hitea-hierarchical-temporal-aware-video | 0.459 |
mplug-2-a-modularized-multi-modal-foundation | 0.480 |
flamingo-a-visual-language-model-for-few-shot-1 | 0.474 |