Cross Modal Retrieval On Flickr30K

평가 지표

Image-to-text R@1

Image-to-text R@10

Image-to-text R@5

Text-to-image R@1

Text-to-image R@10

Text-to-image R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Image-to-text R@1	Image-to-text R@10	Image-to-text R@5	Text-to-image R@1	Text-to-image R@10	Text-to-image R@5	Paper Title	Repository
VSE++ (ResNet)	52.9	87.2	80.5	39.6	79.5	70.1	VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Dual-Path (ResNet)	-	89.5	-	39.1	80.9	69.2	Dual-Path Convolutional Image-Text Embeddings with Instance Loss
VSE-Gradient	97.0	100	99.6	86.3	99.0	97.4	Dissecting Deep Metric Learning Losses for Image-Text Retrieval
SGRAF	77.8	97.4	94.1	58.5	88.8	83.0	Similarity Reasoning and Filtration for Image-Text Matching
NAPReg	79.6	-	-	60.0	-	-	NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings
BEiT-3	98.0	100.0	100.0	90.3	99.5	98.7	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
SCAN	67.4	95.8	90.3	48.6	85.2	77.7	Stacked Cross Attention for Image-Text Matching
DSMD	82.5	97.7	95.5	68.4	94.4	90.8	Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
IAIS	88.3	99.4	98.4	76.86	95.72	93.3	Learning Relation Alignment for Calibrated Cross-modal Retrieval
X-VLM (base)	97.1	100.0	100.0	86.9	98.7	97.3	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
3SHNet	87.1	99.2	98.2	69.5	94.7	91.0	3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
CMPL (ResNet)	49.6	86.1	76.8	37.3	75.5	65.7	Deep Cross-Modal Projection Learning for Image-Text Matching	-
ERNIE-ViL 2.0	97.2	100.0	100.0	93.3	99.8	99.4	ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
GSMN	76.4	97.3	94.3	57.4	89.0	82.3	Graph Structured Network for Image-Text Matching
SCO (ResNet)	55.5	89.3	82.0	41.1	80.1	70.5	Learning Semantic Concepts and Order for Image and Sentence Matching	-
Aurora (ours, r=128)	97.2	100	100	86.8	98.9	97.6	-	-
ViSTA	89.5	99.6	98.4	75.8	96.9	94.2	ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval	-
X2-VLM (base)	98.5	100	100	90.4	99.3	98.2	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
IMRAM	74.1	96.6	93.0	53.9	87.2	79.4	IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
ViLT-B/32	83.5	98.6	96.7	64.4	93.8	88.7	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

0 of 27 row(s) selected.