Cross Modal Retrieval On Flickr30K
評価指標
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | Image-to-text R@1 | Image-to-text R@10 | Image-to-text R@5 | Text-to-image R@1 | Text-to-image R@10 | Text-to-image R@5 |
---|---|---|---|---|---|---|
vse-improving-visual-semantic-embeddings-with | 52.9 | 87.2 | 80.5 | 39.6 | 79.5 | 70.1 |
dual-path-convolutional-image-text-embedding | - | 89.5 | - | 39.1 | 80.9 | 69.2 |
dissecting-deep-metric-learning-losses-for | 97.0 | 100 | 99.6 | 86.3 | 99.0 | 97.4 |
similarity-reasoning-and-filtration-for-image | 77.8 | 97.4 | 94.1 | 58.5 | 88.8 | 83.0 |
napreg-nouns-as-proxies-regularization-for | 79.6 | - | - | 60.0 | - | - |
image-as-a-foreign-language-beit-pretraining | 98.0 | 100.0 | 100.0 | 90.3 | 99.5 | 98.7 |
stacked-cross-attention-for-image-text | 67.4 | 95.8 | 90.3 | 48.6 | 85.2 | 77.7 |
dynamic-self-adaptive-multiscale-distillation | 82.5 | 97.7 | 95.5 | 68.4 | 94.4 | 90.8 |
learning-relation-alignment-for-calibrated | 88.3 | 99.4 | 98.4 | 76.86 | 95.72 | 93.3 |
multi-grained-vision-language-pre-training | 97.1 | 100.0 | 100.0 | 86.9 | 98.7 | 97.3 |
3shnet-boosting-image-sentence-retrieval-via | 87.1 | 99.2 | 98.2 | 69.5 | 94.7 | 91.0 |
deep-cross-modal-projection-learning-for | 49.6 | 86.1 | 76.8 | 37.3 | 75.5 | 65.7 |
ernie-vil-2-0-multi-view-contrastive-learning | 97.2 | 100.0 | 100.0 | 93.3 | 99.8 | 99.4 |
graph-structured-network-for-image-text | 76.4 | 97.3 | 94.3 | 57.4 | 89.0 | 82.3 |
learning-semantic-concepts-and-order-for | 55.5 | 89.3 | 82.0 | 41.1 | 80.1 | 70.5 |
モデル 16 | 97.2 | 100 | 100 | 86.8 | 98.9 | 97.6 |
vista-vision-and-scene-text-aggregation-for | 89.5 | 99.6 | 98.4 | 75.8 | 96.9 | 94.2 |
x-2-vlm-all-in-one-pre-trained-model-for | 98.5 | 100 | 100 | 90.4 | 99.3 | 98.2 |
imram-iterative-matching-with-recurrent | 74.1 | 96.6 | 93.0 | 53.9 | 87.2 | 79.4 |
vilt-vision-and-language-transformer-without | 83.5 | 98.6 | 96.7 | 64.4 | 93.8 | 88.7 |
plug-and-play-regulators-for-image-text | 82.3 | 98.4 | 96.0 | 62.6 | 91.1 | 85.8 |
x-2-vlm-all-in-one-pre-trained-model-for | 98.8 | 100 | 100 | 91.8 | 99.5 | 98.6 |
omnivl-one-foundation-model-for-image | 97.3 | 100 | 99.9 | 87.9 | 99.1 | 97.8 |
vast-a-vision-audio-subtitle-text-omni-1 | - | - | - | 91.0 | 99.5 | 98.5 |
dual-path-convolutional-image-text-embedding | 55.6 | - | 81.9 | - | - | - |
scaling-up-visual-and-vision-language | 95.3 | 100 | 99.8 | 84.9 | 98.6 | 97.4 |
モデル 27 | 75.3 | 97.3 | 93.4 | 54.98 | 88.26 | 81.3 |