8 months ago

Abstract

In this paper, we introduce a large Multi-Attribute and Language Searchdataset for text-based person retrieval, called MALS, and explore thefeasibility of performing pre-training on both attribute recognition andimage-text matching tasks in one stone. In particular, MALS contains 1,510,330image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES,and all images are annotated with 27 attributes. Considering the privacyconcerns and annotation costs, we leverage the off-the-shelf diffusion modelsto generate the dataset. To verify the feasibility of learning from thegenerated data, we develop a new joint Attribute Prompt Learning and TextMatching Learning (APTM) framework, considering the shared knowledge betweenattribute and text. As the name implies, APTM contains an attribute promptlearning stream and a text matching learning stream. (1) The attribute promptlearning leverages the attribute prompts for image-attribute alignment, whichenhances the text matching learning. (2) The text matching learning facilitatesthe representation learning on fine-grained details, and in turn, boosts theattribute prompt learning. Extensive experiments validate the effectiveness ofthe pre-training on MALS, achieving state-of-the-art retrieval performance viaAPTM on three challenging real-world benchmarks. In particular, APTM achieves aconsistent improvement of +6.96%, +7.68%, and +16.95% Recall@1 accuracy onCUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.

Source PDF View Code