HyperAI
Accueil
Actualités
Articles de recherche récents
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Accueil
SOTA
Image Captioning
Image Captioning On Nocaps Val Near Domain
Image Captioning On Nocaps Val Near Domain
Métriques
CIDEr
Pre-train (#images)
SPICE
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
CIDEr
Pre-train (#images)
SPICE
Paper Title
Repository
BLIP-2 ViT-G OPT 6.7B (zero-shot)
119.2
1.1B
15.3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
OmniVL
108.3
14M
14.9
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
VinVL
96.1
5.7M
13.8
VinVL: Revisiting Visual Representations in Vision-Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)
120.2
1.1B
15.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Enc-Dec
88.3
-
12.1
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
BLIP_ViT-L
112.1
129M
14.9
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
LEMON_large
113.3
200M
15.1
Scaling Up Vision-Language Pre-training for Image Captioning
-
SimVLM
110.9
1.8B
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP_CapFilt-L
108.6
129M
14.8
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP-2 ViT-G OPT 2.7B (zero-shot)
117.8
1.1B
15.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
0 of 10 row(s) selected.
Previous
Next