Meshed-Memory Transformer | 80.8 | 39.1 | 131.2 | 29.2 | 58.6 | 22.6 | Meshed-Memory Transformer for Image Captioning | |
RefineCap (w/ REINFORCE) | 80.2 | 37.8 | 127.2 | 28.3 | 58.0 | 22.5 | RefineCap: Concept-Aware Refinement for Image Captioning | - |
X-Transformer | 80.9 | 39.7 | 132.8 | 29.5 | 59.1 | 23.4 | X-Linear Attention Networks for Image Captioning | |
CLIP Text Encoder (RL w/ CIDEr-reward) | - | 38.2 | 124.9 | 28.7 | 58.5 | - | Fine-grained Image Captioning with CLIP Reward | |
From Captions to Visual Concepts and Back | - | 25.7 | - | 23.6 | - | - | From Captions to Visual Concepts and Back | |
KOSMOS-1 (1.6B) (zero-shot) | - | - | 84.7 | - | - | 16.8 | - | - |
ClipCap (Transformer) | - | 33.53 | 113.08 | 27.45 | - | 21.05 | ClipCap: CLIP Prefix for Image Captioning | |
ClipCap (MLP + GPT2 tuning) | - | 32.15 | 108.35 | 27.1 | - | 20.12 | ClipCap: CLIP Prefix for Image Captioning | |
GRIT (No VL pretraining - base) | 84.2 | 42.4 | 144.2 | 30.6 | 60.7 | 24.3 | GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | |