PTP-BLIP (14M) | 81.5 | 97.9 | 95.9 | 64.9 | 92.2 | 87.4 | Position-guided Text Prompt for Vision-Language Pre-training | |
Dual-Path (ResNet) | 41.2 | 81.1 | 70.5 | 25.3 | 66.4 | 53.4 | Deep Visual-Semantic Alignments for Generating Image Descriptions | |
VSE-Gradient | 81.4 | 97.9 | 95.6 | 63.6 | 91.5 | 86.0 | Dissecting Deep Metric Learning Losses for Image-Text Retrieval | |
X2-VLM (base) | 83.5 | 98.5 | 96.3 | 66.2 | 92.2 | 87.1 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |
X2-VLM (large) | 84.4 | 98.5 | 96.5 | 67.7 | 92.5 | 87.5 | X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | |