VSE++
(ResNet) | 52.9 | 87.2 | 80.5 | 39.6 | 79.5 | 70.1 | VSE++: Improving Visual-Semantic Embeddings with Hard Negatives | - |
Dual-Path (ResNet) | - | 89.5 | - | 39.1 | 80.9 | 69.2 | Dual-Path Convolutional Image-Text Embeddings with Instance Loss | - |
VSE-Gradient | 97.0 | 100 | 99.6 | 86.3 | 99.0 | 97.4 | Dissecting Deep Metric Learning Losses for Image-Text Retrieval | - |
SCAN | 67.4 | 95.8 | 90.3 | 48.6 | 85.2 | 77.7 | Stacked Cross Attention for Image-Text Matching | - |
CMPL
(ResNet) | 49.6 | 86.1 | 76.8 | 37.3 | 75.5 | 65.7 | Deep Cross-Modal Projection Learning for Image-Text Matching | - |
GSMN | 76.4 | 97.3 | 94.3 | 57.4 | 89.0 | 82.3 | Graph Structured Network for Image-Text Matching | - |