HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Image Captioning
Image Captioning On Nocaps Val Overall
Image Captioning On Nocaps Val Overall
Metrics
CIDEr
Pretrain (#images)
SPICE
Results
Performance results of various models on this benchmark
Columns
Model Name
CIDEr
Pretrain (#images)
SPICE
Paper Title
BLIP-2 ViT-G FlanT5 XL (zero-shot)
121.6
1.1B
15.8
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 6.7B (zero-shot)
121.0
1.1B
15.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 2.7B (zero-shot)
119.7
1.1B
15.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LEMON_large
113.4
200M
15.0
Scaling Up Vision-Language Pre-training for Image Captioning
BLIP_ViT-L
113.2
129M
14.8
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
SimVLM
112.2
1.8B
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP_CapFilt-L
109.6
129M
14.7
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
OmniVL
107.5
14M
14.7
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
VinVL
95.5
5.7M
13.5
VinVL: Revisiting Visual Representations in Vision-Language Models
Enc-Dec
90.2
-
12.1
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
OSCAR
80.9
345M
11.3
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
0 of 11 row(s) selected.
Previous
Next
Image Captioning On Nocaps Val Overall | SOTA | HyperAI