Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person Retrieval

Person retrieval has attracted rising attention. Existing methods are mainlydivided into two retrieval modes, namely image-only and text-only. However,they are unable to make full use of the available information and are difficultto meet diverse application requirements. To address the above limitations, wepropose a new Composed Person Retrieval (CPR) task, which combines visual andtextual queries to identify individuals of interest from large-scale personimage databases. Nevertheless, the foremost difficulty of the CPR task is thelack of available annotated datasets. Therefore, we first introduce a scalableautomatic data synthesis pipeline, which decomposes complex multimodal datageneration into the creation of textual quadruples followed byidentity-consistent image synthesis using fine-tuned generative models.Meanwhile, a multimodal filtering method is designed to ensure the resultingSynCPR dataset retains 1.15 million high-quality and fully synthetic triplets.Additionally, to improve the representation of composed person queries, wepropose a novel Fine-grained Adaptive Feature Alignment (FAFA) frameworkthrough fine-grained dynamic alignment and masked feature reasoning. Moreover,for objective evaluation, we manually annotate the Image-Text Composed PersonRetrieval (ITCPR) test set. The extensive experiments demonstrate theeffectiveness of the SynCPR dataset and the superiority of the proposed FAFAframework when compared with the state-of-the-art methods. All code and datawill be provided athttps://github.com/Delong-liu-bupt/Composed_Person_Retrieval.