Learning Semantic-Aligned Feature Representation for Text-based Person Search

Text-based person search aims to retrieve images of a certain pedestrian by atextual description. The key challenge of this task is to eliminate theinter-modality gap and achieve the feature alignment across modalities. In thispaper, we propose a semantic-aligned embedding method for text-based personsearch, in which the feature alignment across modalities is achieved byautomatically learning the semantic-aligned visual features and textualfeatures. First, we introduce two Transformer-based backbones to encode robustfeature representations of the images and texts. Second, we design asemantic-aligned feature aggregation network to adaptively select and aggregatefeatures with the same semantics into part-aware features, which is achieved bya multi-head attention module constrained by a cross-modality part alignmentloss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30Kdatasets show that our method achieves state-of-the-art performances.