HyperAI

Zero Shot Cross Modal Retrieval On Flickr30K

Métriques

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleImage-to-text R@1Image-to-text R@10Image-to-text R@5Text-to-image R@1Text-to-image R@10Text-to-image R@5
cosmos-cross-modality-self-distillation-for89.999.398.876.196.292.8
reproducible-scaling-laws-for-contrastive--99.3--94.1
vilt-vision-and-language-transformer-without73.296.593.65589.882.5
ernie-vil-2-0-multi-view-contrastive-learning91.299.899.177.496.493.8
align-before-fuse-vision-and-language90.599.798.876.896.793.7
scaling-up-visual-and-vision-language88.699.798.775.796.893.8
altclip-altering-the-language-encoder-in-clip8699.19872.595.491.6
internvl-scaling-up-vision-foundation-models95.799.999.785.098.697.0
cosmos-cross-modality-self-distillation-for92.999.999.480.397.695.3
implicit-differentiable-outlier-detection89.099.899.277.298.294.3
coca-contrastive-captioners-are-image-text92.599.999.580.497.795.7
internvl-scaling-up-vision-foundation-models94.799.999.681.798.296.0
image-as-a-foreign-language-beit-pretraining94.9100.099.981.597.895.6
flamingo-a-visual-language-model-for-few-shot-189.399.798.879.597.995.3
position-guided-text-prompt-for-vision87.199.398.473.194.891.0
florence-a-new-foundation-model-for-computer90.9-99.176.7-93.6
boldsymbol-m-2-encoder-advancing-bilingual91.299.699.292.299.799.5
imagebert-cross-modal-pre-training-with-large70.794.090.254.387.579.6
vast-a-vision-audio-subtitle-text-omni-1---90.4--
uniter-learning-universal-image-text-180.798.095.766.292.988.4
learning-transferable-visual-models-from88.099.498.768.795.290.6
region-aware-pretraining-for-open-vocabulary92.199.799.480.797.796.1