Image Captioning On Coco Captions

평가 지표

BLEU-1

BLEU-4

CIDER

METEOR

ROUGE-L

SPICE

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	BLEU-1	BLEU-4	CIDER	METEOR	ROUGE-L	SPICE	Paper Title	Repository
Meshed-Memory Transformer	80.8	39.1	131.2	29.2	58.6	22.6	Meshed-Memory Transformer for Image Captioning
mPLUG	-	46.5	155.1	32.0	-	26.0	mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
LaDiC	-	-	-	-	58.7	22.4	LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Xmodal-Ctx + OSCAR	-	41.3	142.2	-	-	24.9	Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
RDN	80.2	37.3	125.2	28.1	57.4	-	Reflective Decoding Network for Image Captioning	-
SimVLM	-	40.6	143.3	33.4	-	25.4	SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
RefineCap (w/ REINFORCE)	80.2	37.8	127.2	28.3	58.0	22.5	RefineCap: Concept-Aware Refinement for Image Captioning	-
X-Transformer	80.9	39.7	132.8	29.5	59.1	23.4	X-Linear Attention Networks for Image Captioning
CoCa	-	40.9	143.6	33.9	-	24.7	CoCa: Contrastive Captioners are Image-Text Foundation Models
PTP-BLIP (14M)	-	40.1	135.0	30.4	-	23.7	Position-guided Text Prompt for Vision-Language Pre-training
CLIP Text Encoder (RL w/ CIDEr-reward)	-	38.2	124.9	28.7	58.5	-	Fine-grained Image Captioning with CLIP Reward
AoANet + VC	-	39.5	-	29.3	59.3	-	Visual Commonsense R-CNN
X-VLM (base)	-	41.3	140.8	-	-	-	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
From Captions to Visual Concepts and Back	-	25.7	-	23.6	-	-	From Captions to Visual Concepts and Back
LEMON	-	42.6	145.5	31.4	-	25.5	Scaling Up Vision-Language Pre-training for Image Captioning	-
KOSMOS-1 (1.6B) (zero-shot)	-	-	84.7	-	-	16.8	-	-
ClipCap (Transformer)	-	33.53	113.08	27.45	-	21.05	ClipCap: CLIP Prefix for Image Captioning
ClipCap (MLP + GPT2 tuning)	-	32.15	108.35	27.1	-	20.12	ClipCap: CLIP Prefix for Image Captioning
BLIP-2 ViT-G FlanT5 XL (zero-shot)	-	42.4	144.5	-	-	-	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
GRIT (No VL pretraining - base)	84.2	42.4	144.2	30.6	60.7	24.3	GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

0 of 40 row(s) selected.