Cross Modal Retrieval On Coco 2014
Métriques
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Résultats
Résultats de performance de divers modèles sur ce benchmark
Tableau comparatif
Nom du modèle | Image-to-text R@1 | Image-to-text R@10 | Image-to-text R@5 | Text-to-image R@1 | Text-to-image R@10 | Text-to-image R@5 |
---|---|---|---|---|---|---|
polysemous-visual-semantic-embedding-for-1 | 45.2 | 84.5 | 74.3 | 32.4 | 75.0 | 63.0 |
3shnet-boosting-image-sentence-retrieval-via | 67.9 | 95.4 | 90.5 | 50.3 | 87.7 | 79.3 |
toward-building-general-foundation-models-for | 84.2 | 98.4 | 96.4 | 67.0 | 92.4 | 87.2 |
an-empirical-study-of-training-end-to-end | 76.16 | 96.82 | 93.16 | 57.08 | 90.07 | 82.66 |
similarity-reasoning-and-filtration-for-image | 57.8 | 91.6 | 84.9 | 41.9 | 81.3 | 70.7 |
lile-look-in-depth-before-looking-elsewhere-a | 55.6 | 91.0 | 82.4 | 41.5 | 82.2 | 72.1 |
vilt-vision-and-language-transformer-without | 61.5 | 92.7 | 86.3 | 42.7 | 83.1 | 72.9 |
position-guided-text-prompt-for-vision | 81.5 | 97.9 | 95.9 | 64.9 | 92.2 | 87.4 |
deep-visual-semantic-alignments-for | 41.2 | 81.1 | 70.5 | 25.3 | 66.4 | 53.4 |
dissecting-deep-metric-learning-losses-for | 81.4 | 97.9 | 95.6 | 63.6 | 91.5 | 86.0 |
imram-iterative-matching-with-recurrent | 53.7 | 91.0 | 83.2 | 39.7 | 79.8 | 69.1 |
visual-semantic-reasoning-for-image-text | 53.0 | 89.4 | 81.1 | 40.5 | 81.1 | 70.6 |
florence-a-new-foundation-model-for-computer | 81.8 | - | 95.2 | 63.2 | - | 85.7 |
dynamic-self-adaptive-multiscale-distillation | 48.0 | 84.5 | 75.6 | 62.1 | 92.0 | 85.9 |
align-before-fuse-vision-and-language | 77.6 | 97.2 | 94.3 | 60.7 | 90.5 | 84.3 |
x-2-vlm-all-in-one-pre-trained-model-for | 83.5 | 98.5 | 96.3 | 66.2 | 92.2 | 87.1 |
omnivl-one-foundation-model-for-image | 82.1 | 98.1 | 95.9 | 64.8 | 91.6 | 86.1 |
multi-grained-vision-language-pre-training | 81.2 | 98.2 | 95.6 | 63.4 | 91.5 | 85.8 |
x-2-vlm-all-in-one-pre-trained-model-for | 84.4 | 98.5 | 96.5 | 67.7 | 92.5 | 87.5 |
ernie-vil-2-0-multi-view-contrastive-learning | 77.4 | 97.1 | 93.6 | 59.5 | 90.1 | 83.4 |
oscar-object-semantics-aligned-pre-training | 73.5 | 96.0 | 92.2 | 57.5 | 89.8 | 82.8 |
visualsparta-sparse-transformer-fragment | - | - | - | 44.4 | 82.4 | 72.8 |
vast-a-vision-audio-subtitle-text-omni-1 | - | - | - | 68.0 | 92.8 | 87.7 |
plug-and-play-regulators-for-image-text | 61.3 | 92.6 | 86.1 | 44.3 | 83.2 | 73.2 |
implicit-differentiable-outlier-detection | 80.7 | 96.8 | 95.1 | 62.9 | 92.8 | 84.8 |
vision-language-pre-training-with-triple | 75.6 | 96.7 | 92.8 | 59.0 | 89.9 | 83.2 |
napreg-nouns-as-proxies-regularization-for | 59.8 | - | - | 43.0 | - | - |
Modèle 28 | 80.7 | 97.8 | 95.3 | 62.8 | 91 | 84.8 |
vista-vision-and-scene-text-aggregation-for | 68.9 | 95.4 | 90.1 | 52.6 | 87.6 | 79.6 |
aladin-distilling-fine-grained-alignment | 64.9 | 94.5 | 88.6 | 51.3 | 87.5 | 79.2 |
stacked-cross-attention-for-image-text | 50.4 | 90.0 | 82.2 | 38.6 | 80.4 | 69.3 |
learning-semantic-concepts-and-order-for | 42.8 | 83.0 | 72.3 | 33.1 | 75.5 | 62.9 |
scaling-up-visual-and-vision-language | 77 | 96.9 | 93.5 | 59.9 | 89.8 | 83.3 |
image-as-a-foreign-language-beit-pretraining | 84.8 | 98.3 | 96.5 | 67.2 | 87.7 | 92.8 |
mammut-a-simple-architecture-for-joint | 70.7 | 93.7 | 89.1 | - | - | - |
valor-vision-audio-language-omni-perception | - | - | - | 61.4 | 90.9 | 84.4 |