HyperAI

Cross Modal Retrieval On Flickr30K

Metriken

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

Ergebnisse

Leistungsergebnisse verschiedener Modelle zu diesem Benchmark

Modellname
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper TitleRepository
VSE++ (ResNet)52.987.280.539.679.570.1VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Dual-Path (ResNet)-89.5-39.180.969.2Dual-Path Convolutional Image-Text Embeddings with Instance Loss
VSE-Gradient97.010099.686.399.097.4Dissecting Deep Metric Learning Losses for Image-Text Retrieval
SGRAF77.897.494.158.588.883.0Similarity Reasoning and Filtration for Image-Text Matching
NAPReg79.6--60.0--NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings
BEiT-398.0100.0100.090.399.598.7Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
SCAN67.495.890.348.685.277.7Stacked Cross Attention for Image-Text Matching
DSMD82.597.795.568.494.490.8Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
IAIS88.399.498.476.8695.7293.3Learning Relation Alignment for Calibrated Cross-modal Retrieval
X-VLM (base)97.1100.0100.086.998.797.3Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
3SHNet87.199.298.269.594.791.03SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
CMPL (ResNet)49.686.176.837.375.565.7Deep Cross-Modal Projection Learning for Image-Text Matching-
ERNIE-ViL 2.097.2100.0100.093.399.899.4ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
GSMN76.497.394.357.489.082.3Graph Structured Network for Image-Text Matching
SCO (ResNet)55.589.382.041.180.170.5Learning Semantic Concepts and Order for Image and Sentence Matching-
Aurora (ours, r=128)97.210010086.898.997.6--
ViSTA89.599.698.475.896.994.2ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval-
X2-VLM (base)98.510010090.499.398.2X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
IMRAM74.196.693.053.987.279.4IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
ViLT-B/3283.598.696.764.493.888.7ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
0 of 27 row(s) selected.