Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

We investigate composed image retrieval with text feedback. Users graduallylook for the target of interest by moving from coarse to fine-grained feedback.However, existing methods merely focus on the latter, i.e., fine-grainedsearch, by harnessing positive and negative pairs during training. Thispair-based paradigm only considers the one-to-one distance between a pair ofspecific points, which is not aligned with the one-to-many coarse-grainedretrieval process and compromises the recall rate. In an attempt to fill thisgap, we introduce a unified learning approach to simultaneously modeling thecoarse- and fine-grained retrieval by considering the multi-graineduncertainty. The key idea underpinning the proposed method is to integratefine- and coarse-grained retrieval as matching data points with small and largefluctuations, respectively. Specifically, our method contains two modules:uncertainty modeling and uncertainty regularization. (1) The uncertaintymodeling simulates the multi-grained queries by introducing identicallydistributed fluctuations in the feature space. (2) Based on the uncertaintymodeling, we further introduce uncertainty regularization to adapt the matchingobjective according to the fluctuation range. Compared with existing methods,the proposed strategy explicitly prevents the model from pushing away potentialcandidates in the early stage, and thus improves the recall rate. On the threepublic datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed methodhas achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strongbaseline, respectively.