Visual Question Answering On Mm Vet
評価指標
GPT-4 score
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | GPT-4 score |
---|---|
list-items-one-by-one-a-new-data-source-and | 37.2 |
provision-programmatically-scaling-vision | 40.4 |
mmctagent-multi-modal-critical-thinking-agent | 74.24 |
deepseek-vl-towards-real-world-vision | 41.5 |
volcano-mitigating-multimodal-hallucination | 38.0 |
lova3-learning-to-visual-question-answering | 35.2 |
gamified-crowd-sourcing-of-high-quality-data | 52.43 |
mplug-owl2-revolutionizing-multi-modal-large | 36.3±0.1 |
llava-plus-learning-to-use-tools-for-creating | 27.5±0.3 |
mixture-of-subspaces-in-low-rank-adaptation | 35.2 |
janusflow-harmonizing-autoregression-and | 30.9 |
convllava-hierarchical-backbones-as-visual | 45.9 |
mini-gemini-mining-the-potential-of-multi | 53.0 |
infimm-hd-a-leap-forward-in-high-resolution | 38.9 |
a-stitch-in-time-saves-nine-small-vlm-is-a | 52.10 |
what-if-we-recaption-billions-of-web-images | 37.8 |
モデル 17 | 64.4 |
feast-your-eyes-mixture-of-resolution | 35.5 |
collavo-crayon-large-language-and-vision | 40.3 |
silkie-preference-distillation-for-large | 49.9 |
janus-pro-unified-multimodal-understanding | 39.8 |
expanding-performance-boundaries-of-open | 72.3 |
calibrated-self-rewarding-vision-language | 33.9 |
mm-instruct-generated-visual-instructions-for | 37.1 |
densefusion-1m-merging-vision-experts-for | 37.8 |
inf-llava-dual-perspective-perception-for | 34.5 |
llavolta-efficient-multi-modal-models-via | 30.7 |
mm1-5-methods-analysis-insights-from | 52.0 |
minigpt-4-enhancing-vision-language | 24.4±0.4 |
gpt-4-technical-report-1 | 67.6±0.1 |
enhancing-visual-language-modality-alignment | 31.6 |
モデル 32 | 78.1±0.2 |
generative-multimodal-models-are-in-context | 48.5 |
cogvlm-visual-expert-for-pretrained-language | 63.9 |
mousi-poly-visual-expert-vision-language | 38.4 |
llava-ph-efficient-multi-modal-assistant-with | 28.9 |
mimic-it-multi-modal-in-context-instruction | 24.7±0.3 |
deciphering-cross-modal-alignment-in-large | 32.2 |
モデル 39 | 61.8 |
expanding-performance-boundaries-of-open | 68.8 |
mixture-of-subspaces-in-low-rank-adaptation | 35.2 |
cross-modal-safety-mechanism-transfer-in | 25.6 |
strengthening-multimodal-large-language-model | 41.4 |
a-stitch-in-time-saves-nine-small-vlm-is-a | 63.20 |
openflamingo-an-open-source-framework-for | 21.8±0.1 |
tokenpacker-efficient-visual-projector-for | 29.6 |
camml-context-aware-multimodal-learner-for | 36.4 |
mminstruct-a-high-quality-multi-modal | 34.4 |
dynamic-llava-efficient-multimodal-large | 37.3 |
vila-on-pre-training-for-visual-language | 45.7 |
uni-moe-scaling-unified-multimodal-llms-with | 32.8 |
beyond-embeddings-the-promise-of-visual-table | 39.8 |
expanding-performance-boundaries-of-open | 65.0 |
mm1-methods-analysis-insights-from-multimodal | 42.1 |
mini-gemini-mining-the-potential-of-multi | 60.8 |
visionzip-longer-is-better-but-not-necessary | 32.9 |
video-llava-learning-united-visual-1 | 32.0 |
sphinx-x-scaling-data-and-parameters-for-a | 47.9 |
visionzip-longer-is-better-but-not-necessary | 30.2 |
emu3-next-token-prediction-is-all-you-need | 37.2 |
mg-llava-towards-multi-granularity-visual | 48.5 |
taco-learning-multi-modal-action-models-with | 45.2 |
llava-plus-learning-to-use-tools-for-creating | 35.0±0.0 |
aligngpt-multi-modal-large-language-models | 35.6 |
meteor-mamba-based-traversal-of-rationale-for | 57.3 |
lyra-an-efficient-and-speech-centric | 63.5 |
mammoth-vl-eliciting-multimodal-reasoning | 60.6 |
volcano-mitigating-multimodal-hallucination | 32.0 |
mm1-5-methods-analysis-insights-from | 41.0 |
mmdu-a-multi-turn-multi-image-dialog | 38.8 |
モデル 71 | 81.2±0.4 |
chain-of-spot-interactive-reasoning-improves | 37.6 |
aligned-vector-quantization-for-edge-cloud | 30.7 |
textit-v-guided-visual-search-as-a-core | 27.7 |
merlin-empowering-multimodal-llms-with | 34.9 |
gamified-crowd-sourcing-of-high-quality-data | 51.789 |
aligning-large-multi-modal-model-with-robust | 31.7±0.1 |
モデル 78 | 54.7 |
mm1-5-methods-analysis-insights-from | 39.8 |
mm1-5-methods-analysis-insights-from | 42.2 |
vlfeedback-a-large-scale-ai-feedback-dataset | 44.2 |
points-improving-your-vision-language-model | 50.0 |
an-empirical-study-of-scaling-instruct-tuned | 36.4 |
multi-modal-auto-regressive-modeling-via | 44.0 |
visionzip-longer-is-better-but-not-necessary | 32.6 |
mmfuser-multimodal-multi-layer-feature-fuser | 36.6 |
img-diff-contrastive-data-synthesis-for | 44.1 |
deepstack-deeply-stacking-visual-tokens-is | 39.3 |
looking-beyond-text-reducing-language-bias-in | 39.90 |
visionzip-longer-is-better-but-not-necessary | 32.6 |
onellm-one-framework-to-align-all-modalities | 29.1 |
internlm-xcomposer2-mastering-free-form-text | 51.2 |
minigpt-4-enhancing-vision-language | 22.1±0.1 |
xmodel-vlm-a-simple-baseline-for-multimodal | 21.8 |
sq-llava-self-questioning-for-large-vision | 35.5 |
focusllava-a-coarse-to-fine-approach-for | 41.3 |
mminstruct-a-high-quality-multi-modal | 37.9 |
a-stitch-in-time-saves-nine-small-vlm-is-a | 65.60 |
h2ovl-mississippi-vision-language-models | 44.7 |
towards-semantic-equivalence-of-tokenization | 48.7 |
gpt-4-technical-report-1 | 69.3±0.1 |
image-of-thought-prompting-for-visual | 72.2 |
mm1-5-methods-analysis-insights-from | 37.4 |
gpt-4-technical-report-1 | 68.6±0.1 |
モデル 105 | 45.3 |
mplug-owl3-towards-long-image-sequence | 40.1 |
tokenpacker-efficient-visual-projector-for | 34.1 |
dreamllm-synergistic-multimodal-comprehension | 35.9 |
visionzip-longer-is-better-but-not-necessary | 31.7 |
strengthening-multimodal-large-language-model | 36.8 |
llava-onevision-easy-visual-task-transfer | 57.5 |
dynamic-mixture-of-experts-an-auto-tuning | 33.6 |
internlm-xcomposer2-4khd-a-pioneering-large | 54.9 |
illume-illuminating-your-llms-to-see-draw-and | 37.0 |
aligngpt-multi-modal-large-language-models | 30.8 |
hallucination-augmented-contrastive-learning | 30.4 |
maven-an-effective-multi-granularity-hybrid | 30.4 |
vlfeedback-a-large-scale-ai-feedback-dataset | 50.7 |
omnifusion-technical-report | 39.40 |
claude-3-5-sonnet-model-card-addendum | 74.2±0.2 |
lyra-an-efficient-and-speech-centric | 71.4 |
video-lavit-unified-video-language-pre | 33.2 |
cumo-scaling-multimodal-llm-with-co-upcycled | 51.0 |
robocodex-multimodal-code-generation-for | 31.0 |
dragonfly-multi-resolution-zoom-supercharges | 35.9 |
qwen-vl-a-frontier-large-vision-language | 66.6±0.5 |
mm1-methods-analysis-insights-from-multimodal | 43.7 |
teamlora-boosting-low-rank-adaptation-with | 31.2 |
vlfeedback-a-large-scale-ai-feedback-dataset | 49.9 |
improved-baselines-with-visual-instruction | 31.1±0.2 |
mm-react-prompting-chatgpt-for-multimodal | 27.9±0.1 |
baichuan-omni-technical-report | 65.4 |
calibrated-self-rewarding-vision-language | 37.8 |
janus-pro-unified-multimodal-understanding | 50.0 |
mmar-towards-lossless-multi-modal-auto | 18.49 |
cogagent-a-visual-language-model-for-gui | 52.8 |
gemini-a-family-of-highly-capable-multimodal-1 | 64.3±0.4 |
qwen2-vl-enhancing-vision-language-model-s | 49.5 |
stablellava-enhanced-visual-instruction | 36.1 |
imp-highly-capable-large-multimodal-models | 44.6 |
cogvlm-visual-expert-for-pretrained-language | 52.8 |
flashsloth-lightning-multimodal-large | 41.9 |
list-items-one-by-one-a-new-data-source-and | 35.9 |
sharegpt4v-improving-large-multi-modal-models | 43.1 |
llava-onevision-easy-visual-task-transfer | 63.7 |
mini-gemini-mining-the-potential-of-multi | 59.3 |
モデル 147 | 57.4 |
rethinking-visual-prompting-for-multimodal | 35.1 |
deciphering-cross-modal-alignment-in-large | 42.9 |
qwen2-vl-enhancing-vision-language-model-s | 74.0 |
ferret-v2-an-improved-baseline-for-referring | 35.7 |
moai-mixture-of-all-intelligence-for-large | 43.7 |
explore-the-limits-of-omni-modal-pretraining | 31.4 |
openflamingo-an-open-source-framework-for | 24.8±0.2 |
gpt-4-technical-report-1 | 67.7±0.3 |
tinyllava-a-framework-of-small-scale-large | 32.0 |
expanding-performance-boundaries-of-open | 60.8 |
how-far-are-we-to-gpt-4v-closing-the-gap-to | 48.9 |
densefusion-1m-merging-vision-experts-for | 37.5 |
otterhd-a-high-resolution-multi-modality | 26.3 |
cogvlm2-visual-language-models-for-image-and | 71.1 |
internlm-xcomposer-2-5-a-versatile-large | 51.7 |
taco-learning-multi-modal-action-models-with | 45.7 |
gpt-4-technical-report-1 | 60.2±0.3 |
improved-baselines-with-visual-instruction | 36.3±0.2 |
h2ovl-mississippi-vision-language-models | 30.0 |
enhancing-multimodal-large-language-models | 38.9 |
vlfeedback-a-large-scale-ai-feedback-dataset | 44.1 |
lyra-an-efficient-and-speech-centric | 51.2 |
sea-supervised-embedding-alignment-for-token | 48.8 |
generative-pretraining-in-multimodality | 36.3±0.3 |
how-far-are-we-to-gpt-4v-closing-the-gap-to | 62.8 |
mammoth-vl-eliciting-multimodal-reasoning | 62.3 |
infmllm-a-unified-framework-for-visual | 33.4 |
flashsloth-lightning-multimodal-large | 49.0 |
vila-2-vila-augmented-vila | 50.0 |
blip-2-bootstrapping-language-image-pre | 22.4±0.2 |
looking-beyond-text-reducing-language-bias-in | 35.20 |
expanding-performance-boundaries-of-open | 48.8 |
janus-decoupling-visual-encoding-for-unified | 34.3 |
enhancing-large-vision-language-models-with | 45.0 |
mm1-methods-analysis-insights-from-multimodal | 48.7 |
qwen-vl-a-frontier-large-vision-language | 61.1±0.2 |
vary-scaling-up-the-vision-vocabulary-for | 36.2 |
imp-highly-capable-large-multimodal-models | 43.3 |
enhancing-large-vision-language-models-with | 32.6 |
a-comprehensive-overhaul-of-multimodal | 32.1 |
mimic-it-multi-modal-in-context-instruction | 24.6±0.2 |
vl-mamba-exploring-state-space-models-for | 32.6 |
self-supervised-visual-preference-alignment | 41.0 |
sphinx-the-joint-mixing-of-weights-tasks-and | 40.2 |
dynamic-llava-efficient-multimodal-large | 32.2 |
llava-onevision-easy-visual-task-transfer | 29.1 |
mmar-towards-lossless-multi-modal-auto | 27.80 |
cogvlm2-visual-language-models-for-image-and | 58.0 |
self-supervised-visual-preference-alignment | 37.2 |
gemini-1-5-unlocking-multimodal-understanding | 65.8±0.1 |
visual-agents-as-fast-and-slow-thinkers | 31.0 |
g-mod-exploring-mixture-of-depth-adaptation | 34.0 |
moe-llava-mixture-of-experts-for-large-vision | 35.9 |
imp-highly-capable-large-multimodal-models | 33.5 |
qwen2-vl-enhancing-vision-language-model-s | 62.0 |
improving-multi-modal-large-language-model | 34.8 |
expanding-performance-boundaries-of-open | 60.6 |
coco-is-all-you-need-for-visual-instruction | 37.5 |
provision-programmatically-scaling-vision | 38.5 |
visionzip-longer-is-better-but-not-necessary | 31.7 |
mm1-5-methods-analysis-insights-from | 43.7 |
robomamba-multimodal-state-space-model-for | 29.7 |
the-all-seeing-project-v2-towards-general | 41.3 |
crome-cross-modal-adapters-for-efficient | 55.1 |
expanding-performance-boundaries-of-open | 62.8 |
small-language-model-meets-with-reinforced | 29.0 |
sq-llava-self-questioning-for-large-vision | 39.7 |
llama-adapter-v2-parameter-efficient-visual | 31.4±0.1 |
mm-instruct-generated-visual-instructions-for | 32.9 |
phantom-of-latent-for-large-language-and | 70.8 |
モデル 218 | 58.1±0.1 |
mm-react-prompting-chatgpt-for-multimodal | 44.6±0.2 |
gamified-crowd-sourcing-of-high-quality-data | 64.954 |
sharegpt4v-improving-large-multi-modal-models | 37.6 |
beyond-embeddings-the-promise-of-visual-table | 31.8 |
hyperllava-dynamic-visual-and-language-expert | 31.0 |
trol-traversal-of-layers-for-large-language | 54.7 |
gemini-1-5-unlocking-multimodal-understanding | 76.9±0.1 |
linvt-empower-your-image-level-large-language | 23.5 |
to-see-is-to-believe-prompting-gpt-4v-for | 40.2 |
taco-learning-multi-modal-action-models-with | 50.9 |
textbind-multi-turn-interleaved-multimodal | 19.4 |