HyperAI초신경

Image Captioning On Coco Captions

평가 지표

BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE

평가 결과

이 벤치마크에서 각 모델의 성능 결과

비교 표
모델 이름BLEU-1BLEU-4CIDERMETEORROUGE-LSPICE
m2-meshed-memory-transformer-for-image80.839.1131.229.258.622.6
mplug-effective-and-efficient-vision-language-46.5155.132.0-26.0
ladic-are-diffusion-models-really-inferior-to----58.722.4
beyond-a-pre-trained-object-detector-cross-41.3142.2--24.9
reflective-decoding-network-for-image80.237.3125.228.157.4-
simvlm-simple-visual-language-model-40.6143.333.4-25.4
refinecap-concept-aware-refinement-for-image80.237.8127.228.358.022.5
x-linear-attention-networks-for-image80.939.7132.829.559.123.4
coca-contrastive-captioners-are-image-text-40.9143.633.9-24.7
position-guided-text-prompt-for-vision-40.1135.030.4-23.7
fine-grained-image-captioning-with-clip-38.2124.928.758.5-
visual-commonsense-r-cnn-39.5-29.359.3-
multi-grained-vision-language-pre-training-41.3140.8---
from-captions-to-visual-concepts-and-back-25.7-23.6--
scaling-up-vision-language-pre-training-for-42.6145.531.4-25.5
모델 16--84.7--16.8
clipcap-clip-prefix-for-image-captioning-33.53113.0827.45-21.05
clipcap-clip-prefix-for-image-captioning-32.15108.3527.1-20.12
blip-2-bootstrapping-language-image-pre-42.4144.5---
grit-faster-and-better-image-captioning84.242.4144.230.660.724.3
a-better-variant-of-self-critical-sequence80.739.4129.628.958.722.8
expansionnet-v2-block-static-expansion-in83.542.7143.730.661.124.7
l-verse-bidirectional-generation-between-39.9-31.460.423.3
unifying-architectures-tasks-and-modalities-44.9154.932.5-26.6
vinvl-making-visual-representations-matter-in-41.0140.931.1-25.2
oscar-object-semantics-aligned-pre-training-41.714030.6-24.5
vast-a-vision-audio-subtitle-text-omni-1--149.0--27.0
text-only-training-for-image-captioning-using-26.491.825.1--
git-a-generative-image-to-text-transformer-44.1151.1 32.2-26.3
blip-2-bootstrapping-language-image-pre-43.5145.2---
valor-vision-audio-language-omni-perception--152.5--25.7
blip-2-bootstrapping-language-image-pre-43.7145.8---
prompt-tuning-for-generative-multimodal-41.81141.431.51-24.42
prismer-a-vision-language-model-with-an-40.4136.531.4-24.4
enabling-multimodal-generation-on-clip-via-16.758.319.7-13.4
fusecap-leveraging-large-language-models-to------
beyond-a-pre-trained-object-detector-cross83.441.4139.930.460.424.0
virtex-learning-visual-representations-from--94--18.5
ladic-are-diffusion-models-really-inferior-to-0.382126.229.5--
beyond-a-pre-trained-object-detector-cross81.539.7135.930.059.523.7