HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Zero Shot Cross Modal Retrieval
Zero Shot Cross Modal Retrieval On Flickr30K
Zero Shot Cross Modal Retrieval On Flickr30K
评估指标
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
COSMOS ViT-B/32
89.9
99.3
98.8
76.1
96.2
92.8
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
OpenCLIP VIT-H/14
-
-
99.3
-
-
94.1
Reproducible scaling laws for contrastive language-image learning
ViLT-B/32
73.2
96.5
93.6
55
89.8
82.5
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ERNIE-ViL 2.0
91.2
99.8
99.1
77.4
96.4
93.8
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALBEF
90.5
99.7
98.8
76.8
96.7
93.7
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN
88.6
99.7
98.7
75.7
96.8
93.8
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
AltCLIP
86
99.1
98
72.5
95.4
91.6
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
InternVL-G
95.7
99.9
99.7
85.0
98.6
97.0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
COSMOS ViT-B/16
92.9
99.9
99.4
80.3
97.6
95.3
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
VK-OOD
89.0
99.8
99.2
77.2
98.2
94.3
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
CoCa
92.5
99.9
99.5
80.4
97.7
95.7
CoCa: Contrastive Captioners are Image-Text Foundation Models
InternVL-C
94.7
99.9
99.6
81.7
98.2
96.0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
BEiT-3
94.9
100.0
99.9
81.5
97.8
95.6
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Flamingo
89.3
99.7
98.8
79.5
97.9
95.3
Flamingo: a Visual Language Model for Few-Shot Learning
PTP-BLIP (14M)
87.1
99.3
98.4
73.1
94.8
91.0
Position-guided Text Prompt for Vision-Language Pre-training
Florence
90.9
-
99.1
76.7
-
93.6
Florence: A New Foundation Model for Computer Vision
M2-Encoder
91.2
99.6
99.2
92.2
99.7
99.5
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
ImageBERT
70.7
94.0
90.2
54.3
87.5
79.6
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
-
VAST
-
-
-
90.4
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
UNITER
80.7
98.0
95.7
66.2
92.9
88.4
UNITER: UNiversal Image-TExt Representation Learning
0 of 22 row(s) selected.
Previous
Next