HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Image Captioning
Image Captioning On Nocaps Val Near Domain
Image Captioning On Nocaps Val Near Domain
Metriken
CIDEr
Pre-train (#images)
SPICE
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
CIDEr
Pre-train (#images)
SPICE
Paper Title
Repository
BLIP-2 ViT-G OPT 6.7B (zero-shot)
119.2
1.1B
15.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
OmniVL
108.3
14M
14.9
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
VinVL
96.1
5.7M
13.8
VinVL: Revisiting Visual Representations in Vision-Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)
120.2
1.1B
15.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Enc-Dec
88.3
-
12.1
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
BLIP_ViT-L
112.1
129M
14.9
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
LEMON_large
113.3
200M
15.1
Scaling Up Vision-Language Pre-training for Image Captioning
-
SimVLM
110.9
1.8B
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP_CapFilt-L
108.6
129M
14.8
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP-2 ViT-G OPT 2.7B (zero-shot)
117.8
1.1B
15.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
0 of 10 row(s) selected.
Previous
Next