HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Cross Modal Retrieval
Cross Modal Retrieval On Flickr30K
Cross Modal Retrieval On Flickr30K
평가 지표
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
VSE++ (ResNet)
52.9
87.2
80.5
39.6
79.5
70.1
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Dual-Path (ResNet)
-
89.5
-
39.1
80.9
69.2
Dual-Path Convolutional Image-Text Embeddings with Instance Loss
VSE-Gradient
97.0
100
99.6
86.3
99.0
97.4
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
SGRAF
77.8
97.4
94.1
58.5
88.8
83.0
Similarity Reasoning and Filtration for Image-Text Matching
NAPReg
79.6
-
-
60.0
-
-
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings
BEiT-3
98.0
100.0
100.0
90.3
99.5
98.7
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
SCAN
67.4
95.8
90.3
48.6
85.2
77.7
Stacked Cross Attention for Image-Text Matching
DSMD
82.5
97.7
95.5
68.4
94.4
90.8
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
IAIS
88.3
99.4
98.4
76.86
95.72
93.3
Learning Relation Alignment for Calibrated Cross-modal Retrieval
X-VLM (base)
97.1
100.0
100.0
86.9
98.7
97.3
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
3SHNet
87.1
99.2
98.2
69.5
94.7
91.0
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
CMPL (ResNet)
49.6
86.1
76.8
37.3
75.5
65.7
Deep Cross-Modal Projection Learning for Image-Text Matching
-
ERNIE-ViL 2.0
97.2
100.0
100.0
93.3
99.8
99.4
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
GSMN
76.4
97.3
94.3
57.4
89.0
82.3
Graph Structured Network for Image-Text Matching
SCO (ResNet)
55.5
89.3
82.0
41.1
80.1
70.5
Learning Semantic Concepts and Order for Image and Sentence Matching
-
Aurora (ours, r=128)
97.2
100
100
86.8
98.9
97.6
-
-
ViSTA
89.5
99.6
98.4
75.8
96.9
94.2
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
-
X2-VLM (base)
98.5
100
100
90.4
99.3
98.2
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
IMRAM
74.1
96.6
93.0
53.9
87.2
79.4
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
ViLT-B/32
83.5
98.6
96.7
64.4
93.8
88.7
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
0 of 27 row(s) selected.
Previous
Next