Zero Shot Cross Modal Retrieval On Coco 2014
평가 지표
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | Image-to-text R@1 | Image-to-text R@10 | Image-to-text R@5 | Text-to-image R@1 | Text-to-image R@10 | Text-to-image R@5 |
---|---|---|---|---|---|---|
coca-contrastive-captioners-are-image-text | 66.3 | 91.8 | 86.2 | 51.2 | 82.0 | 74.2 |
learning-transferable-visual-models-from | 58.4 | 88.1 | 81.5 | 37.8 | 72.2 | 62.4 |
vilt-vision-and-language-transformer-without | 56.5 | 89.6 | 82.6 | 40.4 | 81.1 | 70 |
ernie-vil-2-0-multi-view-contrastive-learning | 63.1 | 91.4 | 85.7 | 46.0 | 80.4 | 71.4 |
align-before-fuse-vision-and-language | 68.7 | 94.7 | 89.5 | 50.1 | 84.5 | 76.4 |
cosmos-cross-modality-self-distillation-for | 64.3 | 92.0 | 86.5 | 48.4 | 82.6 | 74.2 |
position-guided-text-prompt-for-vision | 69.7 | 94.7 | 90.0 | 49.5 | 84.2 | 75.9 |
cosmos-cross-modality-self-distillation-for | 68.0 | 92.5 | 87.8 | 52.5 | 84.9 | 77.2 |
imagebert-cross-modal-pre-training-with-large | 44.0 | 80.4 | 71.2 | 32.3 | 70.2 | 59.0 |
internvl-scaling-up-vision-foundation-models | 70.6 | 93.5 | 89.0 | 54.1 | 84.6 | 77.3 |
flamingo-a-visual-language-model-for-few-shot-1 | 65.9 | 92.9 | 87.3 | 48.0 | 82.1 | 73.3 |
internvl-scaling-up-vision-foundation-models | 74.9 | 95.2 | 91.3 | 58.6 | 88.0 | 81.3 |
scaling-up-visual-and-vision-language | 58.6 | 89.7 | 83.0 | 45.6 | 78.6 | 69.8 |
모델 14 | 0 | 0 | 0 | 0 | 0 | 0 |
florence-a-new-foundation-model-for-computer | 64.7 | - | 85.9 | 47.2 | - | 71.4 |
boldsymbol-m-2-encoder-advancing-bilingual | 72.8 | 96.3 | 92.3 | 56.5 | 88.8 | 81.6 |
region-aware-pretraining-for-open-vocabulary | 68.9 | 92.2 | 87.8 | 51.8 | 83.0 | 75.0 |
vision-language-pre-training-with-triple | 71.4 | 95.4 | 90.8 | 53.5 | 87.1 | 79.0 |