HyperAI超神经

Image Retrieval On Crepe Vision Language

评估指标

Recall@1 (HN-Atom + HN-Comp, SC)
Recall@1 (HN-Atom + HN-Comp, UC)
Recall@1 (HN-Atom, UC)
Recall@1 (HN-Comp, UC)

评测结果

各个模型在此基准测试上的表现结果

模型名称
Recall@1 (HN-Atom + HN-Comp, SC)
Recall@1 (HN-Atom + HN-Comp, UC)
Recall@1 (HN-Atom, UC)
Recall@1 (HN-Comp, UC)
Paper TitleRepository
ViT-B-16 (LAION400M)37.0130.8144.9359.00CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Swin-T (CLIP, CC-12M)--37.344.1Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
RN50 (CC12M)23.2619.9634.8845.27CREPE: Can Vision-Language Foundation Models Reason Compositionally?
ViT-L-14 (LAION400M)39.4433.8147.8660.78CREPE: Can Vision-Language Foundation Models Reason Compositionally?
RN-50 (CLIP, CC-12M)--36.742.9Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
MosaiCLIP (CC-FT)--40.972.4Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
NegCLIP (YFCC-FT)--39.038.8Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
ViT-B-32 (LAION400M)34.2828.0042.7554.80CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CLIP-FT (YFCC-FT)--38.336.4Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
RN101 (YFCC15M)22.7420.5039.5039.56CREPE: Can Vision-Language Foundation Models Reason Compositionally?
ViT-B-16+240 (LAION400M)37.3232.2646.5360.19CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CLIP-FT (CC-FT)--35.645.8Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
CLIP (YFCC-FT)--39.539.8Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
CLIP (CC-FT)--35.045.1Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
Random9.099.0920.0014.29CREPE: Can Vision-Language Foundation Models Reason Compositionally?
RN-50 (NegCLIP, CC-12M)--41.482.0Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
RN-50 (MosaiCLIP, CC-12M)--44.492.6Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
NegCLIP (CC-FT)--37.553.1Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
Swin-T (MosaiCLIP, CC-12M)--44.592.1Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
Swin-T (NegCLIP, CC-12M)--39.680.3Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality-
0 of 22 row(s) selected.