Visual Question Answering On Vqa V2 Test Dev
評価指標
Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | Accuracy |
---|---|
one-peace-exploring-one-general | 82.6 |
towards-vqa-models-that-can-read | 69.21 |
mplug-effective-and-efficient-vision-language | 82.43 |
multi-grained-vision-language-pre-training | 78.22 |
image-as-a-foreign-language-beit-pretraining | 84.19 |
prismer-a-vision-language-model-with-an | 78.43 |
coarse-to-fine-reasoning-for-visual-question | 72.5 |
mutan-multimodal-tucker-fusion-for-visual | 67.42 |
flamingo-a-visual-language-model-for-few-shot-1 | 56.3 |
tips-and-tricks-for-visual-question-answering | 69.87 |
achieving-human-parity-on-visual-question | 81.26 |
align-before-fuse-vision-and-language | 75.84 |
oscar-object-semantics-aligned-pre-training | 73.82 |
simvlm-simple-visual-language-model | 80.03 |
blip-2-bootstrapping-language-image-pre | 52.3 |
implicit-differentiable-outlier-detection | 77.9 |
vilt-vision-and-language-transformer-without | 71.26 |
visual-commonsense-r-cnn | 71.21 |
blip-2-bootstrapping-language-image-pre | 62.3 |
blip-2-bootstrapping-language-image-pre | 49.7 |
deep-modular-co-attention-networks-for-visual-1 | 70.63 |
lyrics-boosting-fine-grained-language-vision | 81.2 |
internvl-scaling-up-vision-foundation-models | 81.2 |
blip-2-bootstrapping-language-image-pre | 52.6 |
multimodal-compact-bilinear-pooling-for | 64.7 |
learning-to-count-objects-in-natural-images | 68.09 |
モデル 27 | 80.23 |
pali-a-jointly-scaled-multilingual-language | 84.3 |
flamingo-a-visual-language-model-for-few-shot-1 | 49.2 |
vlmo-unified-vision-language-pre-training | 82.78 |
sparse-and-continuous-attention-mechanisms | 65.96 |
モデル 32 | 51.0 |
enabling-multimodal-generation-on-clip-via | 44.5 |
lako-knowledge-driven-visual-question | 68.07 |
learning-to-reason-end-to-end-module-networks | 64.9 |
blip-2-bootstrapping-language-image-pre | 65 |
in-defense-of-grid-features-for-visual | 72.59 |
visualbert-a-simple-and-performant-baseline | 70.8 |
vl-bert-pre-training-of-generic-visual | 71.16 |
lxmert-learning-cross-modality-encoder | 69.9 |
block-bilinear-superdiagonal-fusion-for | 67.58 |
x-2-vlm-all-in-one-pre-trained-model-for | 80.4 |
vilbert-pretraining-task-agnostic | 70.55 |
blip-2-bootstrapping-language-image-pre | 63 |
cumo-scaling-multimodal-llm-with-co-upcycled | 82.2 |
murel-multimodal-relational-reasoning-for | 68.03 |
compact-trilinear-interaction-for-visual | 67.4 |
plug-and-play-vqa-zero-shot-vqa-by-conjoining | 64.8 |
bilinear-attention-networks | 70.04 |
rubi-reducing-unimodal-biases-in-visual | 63.18 |
toward-building-general-foundation-models-for | 80.4 |
vl-bert-pre-training-of-generic-visual | 71.79 |
valor-vision-audio-language-omni-perception | 78.46 |
uniter-learning-universal-image-text-1 | 73.24 |
flamingo-a-visual-language-model-for-few-shot-1 | 51.8 |
x-2-vlm-all-in-one-pre-trained-model-for | 81.9 |