HyperAI

Visual Question Answering On Vqa V2 Test Dev

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Comparison Table
Model NameAccuracy
one-peace-exploring-one-general82.6
towards-vqa-models-that-can-read69.21
mplug-effective-and-efficient-vision-language82.43
multi-grained-vision-language-pre-training78.22
image-as-a-foreign-language-beit-pretraining84.19
prismer-a-vision-language-model-with-an78.43
coarse-to-fine-reasoning-for-visual-question72.5
mutan-multimodal-tucker-fusion-for-visual67.42
flamingo-a-visual-language-model-for-few-shot-156.3
tips-and-tricks-for-visual-question-answering69.87
achieving-human-parity-on-visual-question81.26
align-before-fuse-vision-and-language75.84
oscar-object-semantics-aligned-pre-training73.82
simvlm-simple-visual-language-model80.03
blip-2-bootstrapping-language-image-pre52.3
implicit-differentiable-outlier-detection77.9
vilt-vision-and-language-transformer-without71.26
visual-commonsense-r-cnn71.21
blip-2-bootstrapping-language-image-pre62.3
blip-2-bootstrapping-language-image-pre49.7
deep-modular-co-attention-networks-for-visual-170.63
lyrics-boosting-fine-grained-language-vision81.2
internvl-scaling-up-vision-foundation-models81.2
blip-2-bootstrapping-language-image-pre52.6
multimodal-compact-bilinear-pooling-for64.7
learning-to-count-objects-in-natural-images68.09
Model 2780.23
pali-a-jointly-scaled-multilingual-language84.3
flamingo-a-visual-language-model-for-few-shot-149.2
vlmo-unified-vision-language-pre-training82.78
sparse-and-continuous-attention-mechanisms65.96
Model 3251.0
enabling-multimodal-generation-on-clip-via44.5
lako-knowledge-driven-visual-question68.07
learning-to-reason-end-to-end-module-networks64.9
blip-2-bootstrapping-language-image-pre65
in-defense-of-grid-features-for-visual72.59
visualbert-a-simple-and-performant-baseline70.8
vl-bert-pre-training-of-generic-visual71.16
lxmert-learning-cross-modality-encoder69.9
block-bilinear-superdiagonal-fusion-for67.58
x-2-vlm-all-in-one-pre-trained-model-for80.4
vilbert-pretraining-task-agnostic70.55
blip-2-bootstrapping-language-image-pre63
cumo-scaling-multimodal-llm-with-co-upcycled82.2
murel-multimodal-relational-reasoning-for68.03
compact-trilinear-interaction-for-visual67.4
plug-and-play-vqa-zero-shot-vqa-by-conjoining64.8
bilinear-attention-networks70.04
rubi-reducing-unimodal-biases-in-visual63.18
toward-building-general-foundation-models-for80.4
vl-bert-pre-training-of-generic-visual71.79
valor-vision-audio-language-omni-perception78.46
uniter-learning-universal-image-text-173.24
flamingo-a-visual-language-model-for-few-shot-151.8
x-2-vlm-all-in-one-pre-trained-model-for81.9