Data Roaming and Quality Assessment for Composed Image Retrieval

The task of Composed Image Retrieval (CoIR) involves queries that combineimage and text modalities, allowing users to express their intent moreeffectively. However, current CoIR datasets are orders of magnitude smallercompared to other vision and language (V&L) datasets. Additionally, some ofthese datasets have noticeable issues, such as queries containing redundantmodalities. To address these shortcomings, we introduce the Large ScaleComposed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten timeslarger than existing ones. Pre-training on our LaSCo, shows a noteworthyimprovement in performance, even in zero-shot. Furthermore, we propose a newapproach for analyzing CoIR datasets and methods, which detects modalityredundancy or necessity, in queries. We also introduce a new CoIR baseline, theCross-Attention driven Shift Encoder (CASE). This baseline allows for earlyfusion of modalities using a cross-attention module and employs an additionalauxiliary task during training. Our experiments demonstrate that this newbaseline outperforms the current state-of-the-art methods on establishedbenchmarks like FashionIQ and CIRR.