Image Captioning On Nocaps In Domain
Métriques
B1
B2
B3
B4
CIDEr
METEOR
ROUGE-L
SPICE
Résultats
Résultats de performance de divers modèles sur ce benchmark
Nom du modèle | B1 | B2 | B3 | B4 | CIDEr | METEOR | ROUGE-L | SPICE | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|---|---|
GIT2, Single Model | 88.86 | 75.86 | 59.94 | 41.1 | 124.18 | 33.83 | 63.82 | 16.36 | GIT: A Generative Image-to-text Transformer for Vision and Language | |
MD | 84.03 | 69.12 | 51.16 | 33.15 | 100.03 | 30.06 | 59.67 | 14.08 | - | - |
CoCa - Google Brain | 87.27 | 74.29 | 58.01 | 39.24 | 117.9 | 33.01 | 63.12 | 15.49 | - | - |
7_10-7_40000_predict_test.json | 75.31 | 56.79 | 37.85 | 21.91 | 73.73 | 26.02 | 52.44 | 12.04 | - | - |
IEDA-LAB | 84.4 | 69.8 | 51.89 | 32.86 | 102.64 | 30.43 | 60.07 | 14.47 | - | - |
PaLI | - | - | - | - | 149.1 | - | - | - | PaLI: A Jointly-Scaled Multilingual Language-Image Model | |
Xinyi | 81.61 | 63.74 | 43.22 | 24.82 | 84.79 | 27.27 | 55.03 | 12.3 | - | - |
evertyhing | 79.58 | 63.09 | 43.92 | 26.07 | 87.86 | 27.97 | 55.88 | 12.6 | - | - |
CS395T | 72.24 | 51.88 | 29.57 | 14.54 | 58.93 | 22.04 | 49.05 | 8.91 | - | - |
cxy_nocaps_training | 81.64 | 63.79 | 43.43 | 25.15 | 85.81 | 27.25 | 55.06 | 12.35 | - | - |
FudanWYZ | 82.91 | 68.02 | 50.75 | 33.59 | 104.25 | 31.33 | 59.67 | 14.85 | - | - |
GIT, Single Model | 88.55 | 76.1 | 60.53 | 41.65 | 122.4 | 33.41 | 64.02 | 16.18 | GIT: A Generative Image-to-text Transformer for Vision and Language | |
YX | 76.48 | 58.76 | 39.28 | 21.96 | 69.59 | 25.08 | 53.22 | 10.94 | - | - |
UpDown | 77.68 | 60.34 | 41.5 | 24.57 | 74.27 | 26.04 | 54.42 | 11.47 | - | - |
ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | 82.9 | 68.09 | 49.73 | 31.24 | 96.63 | 29.37 | 58.62 | 13.61 | - | - |
Single Model | 84.64 | 70.0 | 52.96 | 34.66 | 108.98 | 31.97 | 61.01 | 14.6 | SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | |
coco_all_19 | 72.76 | 53.52 | 34.13 | 19.45 | 64.37 | 23.47 | 50.53 | 10.11 | - | - |
MQ-UpDown-C | 78.73 | 61.63 | 42.35 | 25.94 | 80.19 | 27.25 | 55.25 | 12.38 | - | - |
Human | 76.89 | 57.3 | 37.78 | 21.49 | 80.61 | 28.53 | 53.47 | 14.99 | - | - |
GRIT (zero-shot, no VL pretraining, no CBS) | - | - | - | - | 105.9 | - | - | 13.6 | GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features |
0 of 41 row(s) selected.