HyperAI超神经

Cross Modal Retrieval On Flickr30K

评估指标

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

评测结果

各个模型在此基准测试上的表现结果

比较表格
模型名称Image-to-text R@1Image-to-text R@10Image-to-text R@5Text-to-image R@1Text-to-image R@10Text-to-image R@5
vse-improving-visual-semantic-embeddings-with52.987.280.539.679.570.1
dual-path-convolutional-image-text-embedding-89.5-39.180.969.2
dissecting-deep-metric-learning-losses-for97.010099.686.399.097.4
similarity-reasoning-and-filtration-for-image77.897.494.158.588.883.0
napreg-nouns-as-proxies-regularization-for79.6--60.0--
image-as-a-foreign-language-beit-pretraining98.0100.0100.090.399.598.7
stacked-cross-attention-for-image-text67.495.890.348.685.277.7
dynamic-self-adaptive-multiscale-distillation82.597.795.568.494.490.8
learning-relation-alignment-for-calibrated88.399.498.476.8695.7293.3
multi-grained-vision-language-pre-training97.1100.0100.086.998.797.3
3shnet-boosting-image-sentence-retrieval-via87.199.298.269.594.791.0
deep-cross-modal-projection-learning-for49.686.176.837.375.565.7
ernie-vil-2-0-multi-view-contrastive-learning97.2100.0100.093.399.899.4
graph-structured-network-for-image-text76.497.394.357.489.082.3
learning-semantic-concepts-and-order-for55.589.382.041.180.170.5
模型 1697.210010086.898.997.6
vista-vision-and-scene-text-aggregation-for89.599.698.475.896.994.2
x-2-vlm-all-in-one-pre-trained-model-for98.510010090.499.398.2
imram-iterative-matching-with-recurrent74.196.693.053.987.279.4
vilt-vision-and-language-transformer-without83.598.696.764.493.888.7
plug-and-play-regulators-for-image-text82.398.496.062.691.185.8
x-2-vlm-all-in-one-pre-trained-model-for98.810010091.899.598.6
omnivl-one-foundation-model-for-image97.310099.987.999.197.8
vast-a-vision-audio-subtitle-text-omni-1---91.099.598.5
dual-path-convolutional-image-text-embedding55.6-81.9---
scaling-up-visual-and-vision-language95.310099.884.998.697.4
模型 2775.397.393.454.9888.2681.3