HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Image Captioning
Image Captioning On Nocaps Val Overall
Image Captioning On Nocaps Val Overall
Metrics
CIDEr
Pretrain (#images)
SPICE
Results
Performance results of various models on this benchmark
Columns
Model Name
CIDEr
Pretrain (#images)
SPICE
Paper Title
Repository
BLIP_CapFilt-L
109.6
129M
14.7
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP_ViT-L
113.2
129M
14.8
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP-2 ViT-G FlanT5 XL (zero-shot)
121.6
1.1B
15.8
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Enc-Dec
90.2
-
12.1
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
BLIP-2 ViT-G OPT 6.7B (zero-shot)
121.0
1.1B
15.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
OSCAR
80.9
345M
11.3
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
VinVL
95.5
5.7M
13.5
VinVL: Revisiting Visual Representations in Vision-Language Models
BLIP-2 ViT-G OPT 2.7B (zero-shot)
119.7
1.1B
15.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
SimVLM
112.2
1.8B
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
OmniVL
107.5
14M
14.7
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
LEMON_large
113.4
200M
15.0
Scaling Up Vision-Language Pre-training for Image Captioning
-
0 of 11 row(s) selected.
Previous
Next