HyperAI초신경

Zero Shot Cross Modal Retrieval On Coco 2014

평가 지표

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper TitleRepository
CoCa66.391.886.251.282.074.2CoCa: Contrastive Captioners are Image-Text Foundation Models
CLIP58.488.181.537.872.262.4Learning Transferable Visual Models From Natural Language Supervision
ViLT-B/3256.589.682.640.481.170ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
ERNIE-ViL 2.063.191.485.746.080.471.4ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
ALBEF68.794.789.550.184.576.4Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
COSMOS ViT-B/3264.392.086.548.482.674.2COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
PTP-BLIP69.794.790.049.584.275.9Position-guided Text Prompt for Vision-Language Pre-training
COSMOS ViT-B/1668.092.587.852.584.977.2COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
ImageBERT44.080.471.232.370.259.0ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data-
InternVL-C70.693.589.054.184.677.3InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Flamingo65.992.987.348.082.173.3Flamingo: a Visual Language Model for Few-Shot Learning
InternVL-G74.995.291.358.688.081.3InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
ALIGN58.689.783.045.678.669.8Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
dfdf000000--
Florence64.7-85.947.2-71.4Florence: A New Foundation Model for Computer Vision
M2-Encoder72.896.392.356.588.881.6M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
RO-ViT68.992.287.851.883.075.0Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
TCL71.495.490.853.587.179.0Vision-Language Pre-Training with Triple Contrastive Learning
0 of 18 row(s) selected.