HyperAI초신경

Image Captioning On Coco Captions

평가 지표

BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE
Paper TitleRepository
Meshed-Memory Transformer80.839.1131.229.258.622.6Meshed-Memory Transformer for Image Captioning
mPLUG-46.5155.132.0-26.0mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
LaDiC----58.722.4LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Xmodal-Ctx + OSCAR-41.3142.2--24.9Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
RDN80.237.3125.228.157.4-Reflective Decoding Network for Image Captioning-
SimVLM-40.6143.333.4-25.4SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
RefineCap (w/ REINFORCE)80.237.8127.228.358.022.5RefineCap: Concept-Aware Refinement for Image Captioning-
X-Transformer80.939.7132.829.559.123.4X-Linear Attention Networks for Image Captioning
CoCa-40.9143.633.9-24.7CoCa: Contrastive Captioners are Image-Text Foundation Models
PTP-BLIP (14M)-40.1135.030.4-23.7Position-guided Text Prompt for Vision-Language Pre-training
CLIP Text Encoder (RL w/ CIDEr-reward)-38.2124.928.758.5-Fine-grained Image Captioning with CLIP Reward
AoANet + VC-39.5-29.359.3-Visual Commonsense R-CNN
X-VLM (base)-41.3140.8---Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
From Captions to Visual Concepts and Back-25.7-23.6--From Captions to Visual Concepts and Back
LEMON-42.6145.531.4-25.5Scaling Up Vision-Language Pre-training for Image Captioning-
KOSMOS-1 (1.6B) (zero-shot)--84.7--16.8--
ClipCap (Transformer)-33.53113.0827.45-21.05ClipCap: CLIP Prefix for Image Captioning
ClipCap (MLP + GPT2 tuning)-32.15108.3527.1-20.12ClipCap: CLIP Prefix for Image Captioning
BLIP-2 ViT-G FlanT5 XL (zero-shot)-42.4144.5---BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
GRIT (No VL pretraining - base)84.242.4144.230.660.724.3GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
0 of 40 row(s) selected.