HyperAI초신경

Zero Shot Cross Modal Retrieval On Coco 2014

평가 지표

Image-to-text R@1

Image-to-text R@10

Image-to-text R@5

Text-to-image R@1

Text-to-image R@10

Text-to-image R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

							Paper Title
InternVL-G	74.9	95.2	91.3	58.6	88.0	81.3	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
M2-Encoder	72.8	96.3	92.3	56.5	88.8	81.6	M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
TCL	71.4	95.4	90.8	53.5	87.1	79.0	Vision-Language Pre-Training with Triple Contrastive Learning
InternVL-C	70.6	93.5	89.0	54.1	84.6	77.3	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
PTP-BLIP	69.7	94.7	90.0	49.5	84.2	75.9	Position-guided Text Prompt for Vision-Language Pre-training
RO-ViT	68.9	92.2	87.8	51.8	83.0	75.0	Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
ALBEF	68.7	94.7	89.5	50.1	84.5	76.4	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
COSMOS ViT-B/16	68.0	92.5	87.8	52.5	84.9	77.2	COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
CoCa	66.3	91.8	86.2	51.2	82.0	74.2	CoCa: Contrastive Captioners are Image-Text Foundation Models
Flamingo	65.9	92.9	87.3	48.0	82.1	73.3	Flamingo: a Visual Language Model for Few-Shot Learning
Florence	64.7	-	85.9	47.2	-	71.4	Florence: A New Foundation Model for Computer Vision
COSMOS ViT-B/32	64.3	92.0	86.5	48.4	82.6	74.2	COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
ERNIE-ViL 2.0	63.1	91.4	85.7	46.0	80.4	71.4	ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALIGN	58.6	89.7	83.0	45.6	78.6	69.8	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
CLIP	58.4	88.1	81.5	37.8	72.2	62.4	Learning Transferable Visual Models From Natural Language Supervision
ViLT-B/32	56.5	89.6	82.6	40.4	81.1	70	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ImageBERT	44.0	80.4	71.2	32.3	70.2	59.0	ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
dfdf	0	0	0	0	0	0	-

0 of 18 row(s) selected.

Zero Shot Cross Modal Retrieval On Coco 2014 | SOTA | HyperAI초신경