HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Zero Shot Cross Modal Retrieval
Zero Shot Cross Modal Retrieval On Coco 2014
Zero Shot Cross Modal Retrieval On Coco 2014
평가 지표
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
CoCa
66.3
91.8
86.2
51.2
82.0
74.2
CoCa: Contrastive Captioners are Image-Text Foundation Models
CLIP
58.4
88.1
81.5
37.8
72.2
62.4
Learning Transferable Visual Models From Natural Language Supervision
ViLT-B/32
56.5
89.6
82.6
40.4
81.1
70
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ERNIE-ViL 2.0
63.1
91.4
85.7
46.0
80.4
71.4
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALBEF
68.7
94.7
89.5
50.1
84.5
76.4
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
COSMOS ViT-B/32
64.3
92.0
86.5
48.4
82.6
74.2
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
PTP-BLIP
69.7
94.7
90.0
49.5
84.2
75.9
Position-guided Text Prompt for Vision-Language Pre-training
COSMOS ViT-B/16
68.0
92.5
87.8
52.5
84.9
77.2
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
ImageBERT
44.0
80.4
71.2
32.3
70.2
59.0
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
-
InternVL-C
70.6
93.5
89.0
54.1
84.6
77.3
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Flamingo
65.9
92.9
87.3
48.0
82.1
73.3
Flamingo: a Visual Language Model for Few-Shot Learning
InternVL-G
74.9
95.2
91.3
58.6
88.0
81.3
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
ALIGN
58.6
89.7
83.0
45.6
78.6
69.8
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
dfdf
0
0
0
0
0
0
-
-
Florence
64.7
-
85.9
47.2
-
71.4
Florence: A New Foundation Model for Computer Vision
M2-Encoder
72.8
96.3
92.3
56.5
88.8
81.6
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
RO-ViT
68.9
92.2
87.8
51.8
83.0
75.0
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
TCL
71.4
95.4
90.8
53.5
87.1
79.0
Vision-Language Pre-Training with Triple Contrastive Learning
0 of 18 row(s) selected.
Previous
Next