HyperAI

Cross Modal Retrieval On Coco 2014

المقاييس

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

اسم النموذج
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper TitleRepository
PVSE45.284.574.332.475.063.0Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
3SHNet67.995.490.550.387.779.33SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
XFM (base)84.298.496.467.092.487.2Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
METER76.1696.8293.1657.0890.0782.66An Empirical Study of Training End-to-End Vision-and-Language Transformers
SGRAF57.891.684.941.981.370.7Similarity Reasoning and Filtration for Image-Text Matching
LILE55.691.082.441.582.272.1LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives-
ViLT-B/3261.592.786.342.783.172.9ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
PTP-BLIP (14M)81.597.995.964.992.287.4Position-guided Text Prompt for Vision-Language Pre-training
Dual-Path (ResNet)41.281.170.525.366.453.4Deep Visual-Semantic Alignments for Generating Image Descriptions
VSE-Gradient81.497.995.663.691.586.0Dissecting Deep Metric Learning Losses for Image-Text Retrieval
IMRAM53.791.083.239.779.869.1IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
VSRN53.089.481.140.581.170.6Visual Semantic Reasoning for Image-Text Matching
Florence81.8-95.263.2-85.7Florence: A New Foundation Model for Computer Vision
DSMD48.084.575.662.192.085.9Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
ALBEF77.697.294.360.790.584.3Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
X2-VLM (base)83.598.596.366.292.287.1X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
OmniVL (14M)82.198.195.964.891.686.1OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
X-VLM (base)81.298.295.663.491.585.8Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
X2-VLM (large)84.498.596.567.792.587.5X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
ERNIE-ViL 2.077.497.193.659.590.183.4ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
0 of 36 row(s) selected.