VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve images of target pedestrianindicated by textual descriptions. It is essential for TBPS to extractfine-grained local features and align them crossing modality. Existing methodsutilize external tools or heavy cross-modal interaction to achieve explicitalignment of cross-modal fine-grained features, which is inefficient andtime-consuming. In this work, we propose a Vision-Guided Semantic-Group Network(VGSG) for text-based person search to extract well-aligned fine-grained visualand textual features. In the proposed VGSG, we develop a Semantic-Group TextualLearning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module toextract textual local features under the guidance of visual local clues. InSGTL, in order to obtain the local textual representation, we group textualfeatures from the channel dimension based on the semantic cues of languageexpression, which encourages similar semantic patterns to be grouped implicitlywithout external tools. In VGKT, a vision-guided attention is employed toextract visual-related textual features, which are inherently aligned withvisual cues and termed vision-guided textual features. Furthermore, we design arelational knowledge transfer, including a vision-language similarity transferand a class probability transfer, to adaptively propagate information of thevision-guided textual features to semantic-group textual features. With thehelp of relational knowledge transfer, VGKT is capable of aligningsemantic-group textual features with corresponding visual features withoutexternal tools and complex pairwise interaction. Experimental results on twochallenging benchmarks demonstrate its superiority over state-of-the-artmethods.