Cross Modal Retrieval On Coco 2014

المقاييس

Image-to-text R@1

Image-to-text R@10

Image-to-text R@5

Text-to-image R@1

Text-to-image R@10

Text-to-image R@5

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

							Paper Title
VAST	-	-	-	68.0	92.8	87.7	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
X2-VLM (large)	84.4	98.5	96.5	67.7	92.5	87.5	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
BEiT-3	84.8	98.3	96.5	67.2	87.7	92.8	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
XFM (base)	84.2	98.4	96.4	67.0	92.4	87.2	Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
X2-VLM (base)	83.5	98.5	96.3	66.2	92.2	87.1	X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
PTP-BLIP (14M)	81.5	97.9	95.9	64.9	92.2	87.4	Position-guided Text Prompt for Vision-Language Pre-training
OmniVL (14M)	82.1	98.1	95.9	64.8	91.6	86.1	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
VSE-Gradient	81.4	97.9	95.6	63.6	91.5	86.0	Dissecting Deep Metric Learning Losses for Image-Text Retrieval
X-VLM (base)	81.2	98.2	95.6	63.4	91.5	85.8	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Florence	81.8	-	95.2	63.2	-	85.7	Florence: A New Foundation Model for Computer Vision
VK-OOD	80.7	96.8	95.1	62.9	92.8	84.8	Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Aurora (ours, r=128)	80.7	97.8	95.3	62.8	91	84.8	-
DSMD	48.0	84.5	75.6	62.1	92.0	85.9	Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
VALOR	-	-	-	61.4	90.9	84.4	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
ALBEF	77.6	97.2	94.3	60.7	90.5	84.3	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN	77	96.9	93.5	59.9	89.8	83.3	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
ERNIE-ViL 2.0	77.4	97.1	93.6	59.5	90.1	83.4	ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
TCL	75.6	96.7	92.8	59.0	89.9	83.2	Vision-Language Pre-Training with Triple Contrastive Learning
Oscar	73.5	96.0	92.2	57.5	89.8	82.8	Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
METER	76.16	96.82	93.16	57.08	90.07	82.66	An Empirical Study of Training End-to-End Vision-and-Language Transformers

0 of 36 row(s) selected.

Command Palette

Cross Modal Retrieval On Coco 2014

المقاييس

النتائج