HyperAI

Zero Shot Cross Modal Retrieval On Flickr30K

المقاييس

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

اسم النموذج
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper TitleRepository
COSMOS ViT-B/3289.999.398.876.196.292.8COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
OpenCLIP VIT-H/14--99.3--94.1Reproducible scaling laws for contrastive language-image learning
ViLT-B/3273.296.593.65589.882.5ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ERNIE-ViL 2.091.299.899.177.496.493.8ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALBEF90.599.798.876.896.793.7Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN88.699.798.775.796.893.8Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
AltCLIP8699.19872.595.491.6AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
InternVL-G95.799.999.785.098.697.0InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
COSMOS ViT-B/1692.999.999.480.397.695.3COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
VK-OOD89.099.899.277.298.294.3Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
CoCa92.599.999.580.497.795.7CoCa: Contrastive Captioners are Image-Text Foundation Models
InternVL-C94.799.999.681.798.296.0InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
BEiT-394.9100.099.981.597.895.6Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Flamingo89.399.798.879.597.995.3Flamingo: a Visual Language Model for Few-Shot Learning
PTP-BLIP (14M)87.199.398.473.194.891.0Position-guided Text Prompt for Vision-Language Pre-training
Florence90.9-99.176.7-93.6Florence: A New Foundation Model for Computer Vision
M2-Encoder91.299.699.292.299.799.5M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
ImageBERT70.794.090.254.387.579.6ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data-
VAST---90.4--VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
UNITER80.798.095.766.292.988.4UNITER: UNiversal Image-TExt Representation Learning
0 of 22 row(s) selected.