Image Captioning On Coco Captions
Metriken
BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | BLEU-1 | BLEU-4 | CIDER | METEOR | ROUGE-L | SPICE |
---|---|---|---|---|---|---|
m2-meshed-memory-transformer-for-image | 80.8 | 39.1 | 131.2 | 29.2 | 58.6 | 22.6 |
mplug-effective-and-efficient-vision-language | - | 46.5 | 155.1 | 32.0 | - | 26.0 |
ladic-are-diffusion-models-really-inferior-to | - | - | - | - | 58.7 | 22.4 |
beyond-a-pre-trained-object-detector-cross | - | 41.3 | 142.2 | - | - | 24.9 |
reflective-decoding-network-for-image | 80.2 | 37.3 | 125.2 | 28.1 | 57.4 | - |
simvlm-simple-visual-language-model | - | 40.6 | 143.3 | 33.4 | - | 25.4 |
refinecap-concept-aware-refinement-for-image | 80.2 | 37.8 | 127.2 | 28.3 | 58.0 | 22.5 |
x-linear-attention-networks-for-image | 80.9 | 39.7 | 132.8 | 29.5 | 59.1 | 23.4 |
coca-contrastive-captioners-are-image-text | - | 40.9 | 143.6 | 33.9 | - | 24.7 |
position-guided-text-prompt-for-vision | - | 40.1 | 135.0 | 30.4 | - | 23.7 |
fine-grained-image-captioning-with-clip | - | 38.2 | 124.9 | 28.7 | 58.5 | - |
visual-commonsense-r-cnn | - | 39.5 | - | 29.3 | 59.3 | - |
multi-grained-vision-language-pre-training | - | 41.3 | 140.8 | - | - | - |
from-captions-to-visual-concepts-and-back | - | 25.7 | - | 23.6 | - | - |
scaling-up-vision-language-pre-training-for | - | 42.6 | 145.5 | 31.4 | - | 25.5 |
Modell 16 | - | - | 84.7 | - | - | 16.8 |
clipcap-clip-prefix-for-image-captioning | - | 33.53 | 113.08 | 27.45 | - | 21.05 |
clipcap-clip-prefix-for-image-captioning | - | 32.15 | 108.35 | 27.1 | - | 20.12 |
blip-2-bootstrapping-language-image-pre | - | 42.4 | 144.5 | - | - | - |
grit-faster-and-better-image-captioning | 84.2 | 42.4 | 144.2 | 30.6 | 60.7 | 24.3 |
a-better-variant-of-self-critical-sequence | 80.7 | 39.4 | 129.6 | 28.9 | 58.7 | 22.8 |
expansionnet-v2-block-static-expansion-in | 83.5 | 42.7 | 143.7 | 30.6 | 61.1 | 24.7 |
l-verse-bidirectional-generation-between | - | 39.9 | - | 31.4 | 60.4 | 23.3 |
unifying-architectures-tasks-and-modalities | - | 44.9 | 154.9 | 32.5 | - | 26.6 |
vinvl-making-visual-representations-matter-in | - | 41.0 | 140.9 | 31.1 | - | 25.2 |
oscar-object-semantics-aligned-pre-training | - | 41.7 | 140 | 30.6 | - | 24.5 |
vast-a-vision-audio-subtitle-text-omni-1 | - | - | 149.0 | - | - | 27.0 |
text-only-training-for-image-captioning-using | - | 26.4 | 91.8 | 25.1 | - | - |
git-a-generative-image-to-text-transformer | - | 44.1 | 151.1 | 32.2 | - | 26.3 |
blip-2-bootstrapping-language-image-pre | - | 43.5 | 145.2 | - | - | - |
valor-vision-audio-language-omni-perception | - | - | 152.5 | - | - | 25.7 |
blip-2-bootstrapping-language-image-pre | - | 43.7 | 145.8 | - | - | - |
prompt-tuning-for-generative-multimodal | - | 41.81 | 141.4 | 31.51 | - | 24.42 |
prismer-a-vision-language-model-with-an | - | 40.4 | 136.5 | 31.4 | - | 24.4 |
enabling-multimodal-generation-on-clip-via | - | 16.7 | 58.3 | 19.7 | - | 13.4 |
fusecap-leveraging-large-language-models-to | - | - | - | - | - | - |
beyond-a-pre-trained-object-detector-cross | 83.4 | 41.4 | 139.9 | 30.4 | 60.4 | 24.0 |
virtex-learning-visual-representations-from | - | - | 94 | - | - | 18.5 |
ladic-are-diffusion-models-really-inferior-to | - | 0.382 | 126.2 | 29.5 | - | - |
beyond-a-pre-trained-object-detector-cross | 81.5 | 39.7 | 135.9 | 30.0 | 59.5 | 23.7 |