Zero Shot Cross Modal Retrieval On Flickr30K

평가 지표

Image-to-text R@1

Image-to-text R@10

Image-to-text R@5

Text-to-image R@1

Text-to-image R@10

Text-to-image R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Image-to-text R@1	Image-to-text R@10	Image-to-text R@5	Text-to-image R@1	Text-to-image R@10	Text-to-image R@5	Paper Title	Repository
COSMOS ViT-B/32	89.9	99.3	98.8	76.1	96.2	92.8	COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
OpenCLIP VIT-H/14	-	-	99.3	-	-	94.1	Reproducible scaling laws for contrastive language-image learning
ViLT-B/32	73.2	96.5	93.6	55	89.8	82.5	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ERNIE-ViL 2.0	91.2	99.8	99.1	77.4	96.4	93.8	ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALBEF	90.5	99.7	98.8	76.8	96.7	93.7	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN	88.6	99.7	98.7	75.7	96.8	93.8	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
AltCLIP	86	99.1	98	72.5	95.4	91.6	AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
InternVL-G	95.7	99.9	99.7	85.0	98.6	97.0	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
COSMOS ViT-B/16	92.9	99.9	99.4	80.3	97.6	95.3	COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
VK-OOD	89.0	99.8	99.2	77.2	98.2	94.3	Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	-
CoCa	92.5	99.9	99.5	80.4	97.7	95.7	CoCa: Contrastive Captioners are Image-Text Foundation Models
InternVL-C	94.7	99.9	99.6	81.7	98.2	96.0	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
BEiT-3	94.9	100.0	99.9	81.5	97.8	95.6	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Flamingo	89.3	99.7	98.8	79.5	97.9	95.3	Flamingo: a Visual Language Model for Few-Shot Learning
PTP-BLIP (14M)	87.1	99.3	98.4	73.1	94.8	91.0	Position-guided Text Prompt for Vision-Language Pre-training
Florence	90.9	-	99.1	76.7	-	93.6	Florence: A New Foundation Model for Computer Vision
M2-Encoder	91.2	99.6	99.2	92.2	99.7	99.5	M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
ImageBERT	70.7	94.0	90.2	54.3	87.5	79.6	ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data	-
VAST	-	-	-	90.4	-	-	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
UNITER	80.7	98.0	95.7	66.2	92.9	88.4	UNITER: UNiversal Image-TExt Representation Learning

0 of 22 row(s) selected.