Image Captioning On Nocaps Val Near Domain

المقاييس

CIDEr

Pre-train (#images)

SPICE

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

				Paper Title
BLIP-2 ViT-G FlanT5 XL (zero-shot)	120.2	1.1B	15.9	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 6.7B (zero-shot)	119.2	1.1B	15.3	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G OPT 2.7B (zero-shot)	117.8	1.1B	15.4	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LEMON_large	113.3	200M	15.1	Scaling Up Vision-Language Pre-training for Image Captioning
BLIP_ViT-L	112.1	129M	14.9	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
SimVLM	110.9	1.8B	-	SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
BLIP_CapFilt-L	108.6	129M	14.8	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
OmniVL	108.3	14M	14.9	OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
VinVL	96.1	5.7M	13.8	VinVL: Revisiting Visual Representations in Vision-Language Models
Enc-Dec	88.3	-	12.1	Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

0 of 10 row(s) selected.

Image Captioning On Nocaps Val Near Domain | SOTA | HyperAI