HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Image-to-Text Retrieval
Image To Text Retrieval On Coco
Image To Text Retrieval On Coco
Metrics
Recall@1
Recall@10
Recall@5
Results
Performance results of various models on this benchmark
Columns
Model Name
Recall@1
Recall@10
Recall@5
Paper Title
Oscar
-
99.8
-
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
BLIP-2 (ViT-G, fine-tuned)
85.4
98.5
97.0
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
ONE-PEACE (ViT-G, w/o ranking)
84.1
98.3
96.3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
BLIP-2 (ViT-L, fine-tuned)
83.5
98.0
96.0
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Unicoder-VL
-
97.2
-
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
IAIS
67.78
94.48
89.7
Learning Relation Alignment for Calibrated Cross-modal Retrieval
CLIP (zero-shot)
58.4
88.1
81.5
Learning Transferable Visual Models From Natural Language Supervision
DVSA
-
74.8
-
Deep Visual-Semantic Alignments for Generating Image Descriptions
SigLIP (ViT-L, zero-shot)
70.6
-
-
Sigmoid Loss for Language Image Pre-Training
FLAVA (ViT-B, zero-shot)
42.74
-
76.76
FLAVA: A Foundational Language And Vision Alignment Model
0 of 10 row(s) selected.
Previous
Next
Image To Text Retrieval On Coco | SOTA | HyperAI