Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Text-to-image person re-identification (ReID) aims to search for imagescontaining a person of interest using textual descriptions. However, due to thesignificant modality gap and the large intra-class variance in textualdescriptions, text-to-image ReID remains a challenging problem. Accordingly, inthis paper, we propose a Semantically Self-Aligned Network (SSAN) to handle theabove problems. First, we propose a novel method that automatically extractssemantically aligned part-level features from the two modalities. Second, wedesign a multi-view non-local network that captures the relationships betweenbody parts, thereby establishing better correspondences between body parts andnoun phrases. Third, we introduce a Compound Ranking (CR) loss that makes useof textual descriptions for other images of the same identity to provide extrasupervision, thereby effectively reducing the intra-class variance in textualfeatures. Finally, to expedite future research in text-to-image ReID, we builda new database named ICFG-PEDES. Extensive experiments demonstrate that SSANoutperforms state-of-the-art approaches by significant margins. Both the newICFG-PEDES database and the SSAN code are available athttps://github.com/zifyloo/SSAN.