Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Text-to-image person retrieval aims to identify the target person based on agiven textual description query. The primary challenge is to learn the mappingof visual and textual modalities into a common latent space. Prior works haveattempted to address this challenge by leveraging separately pre-trainedunimodal models to extract visual and textual features. However, theseapproaches lack the necessary underlying alignment capabilities required tomatch multimodal data effectively. Besides, these works use prior informationto explore explicit part alignments, which may lead to the distortion ofintra-modality information. To alleviate these issues, we present IRRA: across-modal Implicit Relation Reasoning and Aligning framework that learnsrelations between local visual-textual tokens and enhances global image-textmatching without requiring additional prior supervision. Specifically, we firstdesign an Implicit Relation Reasoning module in a masked language modelingparadigm. This achieves cross-modal interaction by integrating the visual cuesinto the textual tokens with a cross-modal multimodal interaction encoder.Secondly, to globally align the visual and textual embeddings, SimilarityDistribution Matching is proposed to minimize the KL divergence betweenimage-text similarity distributions and the normalized label matchingdistributions. The proposed method achieves new state-of-the-art results on allthree public datasets, with a notable margin of about 3%-9% for Rank-1 accuracycompared to prior methods.