Visual Question Answering Vqa On Core Mm
Metrics
Abductive
Analogical
Deductive
Overall score
Params
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | Abductive | Analogical | Deductive | Overall score | Params |
---|---|---|---|---|---|
minigpt-4-enhancing-vision-language | 13.28 | 5.69 | 11.02 | 10.43 | 8B |
blip-2-bootstrapping-language-image-pre | 18.96 | 7.5 | 2.76 | 19.31 | 3B |
gpt-4-technical-report-1 | 77.88 | 69.86 | 74.86 | 74.44 | - |
sphinx-the-joint-mixing-of-weights-tasks-and | 49.85 | 20.69 | 42.17 | 39.48 | 16B |
instructblip-towards-general-purpose-vision | 37.76 | 20.56 | 27.56 | 28.02 | 8B |
generative-pretraining-in-multimodality | 36.57 | 18.19 | 28.9 | 28.24 | 14B |
otter-a-multi-modal-model-with-in-context | 33.64 | 13.33 | 22.49 | 22.69 | 7B |
cogvlm-visual-expert-for-pretrained-language | 47.88 | 28.75 | 36.75 | 37.16 | 17B |
mplug-owl2-revolutionizing-multi-modal-large | 20.6 | 7.64 | 23.43 | 20.05 | 7B |
openflamingo-an-open-source-framework-for | 5.3 | 1.11 | 8.88 | 6.82 | 9B |
improved-baselines-with-visual-instruction | 47.91 | 24.31 | 30.94 | 32.62 | 13B |
qwen-vl-a-frontier-large-vision-language | 44.39 | 30.42 | 37.55 | 37.39 | 16B |
llama-adapter-v2-parameter-efficient-visual | 46.12 | 22.08 | 28.7 | 30.46 | 7B |
internlm-xcomposer-a-vision-language-large | 35.97 | 18.61 | 26.77 | 26.84 | 9B |