iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Given a query consisting of a reference image and a relative caption,Composed Image Retrieval (CIR) aims to retrieve target images visually similarto the reference one while incorporating the changes specified in the relativecaption. The reliance of supervised methods on labor-intensive manually labeleddatasets hinders their broad applicability. In this work, we introduce a newtask, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeledtraining dataset. We propose an approach named iSEARLE (improved zero-ShotcomposEd imAge Retrieval with textuaL invErsion) that involves mapping thevisual information of the reference image into a pseudo-word token in CLIPtoken embedding space and combining it with the relative caption. To fosterresearch on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO(Composed Image Retrieval on Common Objects in context), the first CIR datasetwhere each query is labeled with multiple ground truths and a semanticcategorization. The experimental results illustrate that iSEARLE obtainsstate-of-the-art performance on three different CIR datasets -- FashionIQ,CIRR, and the proposed CIRCO -- and two additional evaluation settings, namelydomain conversion and object composition. The dataset, the code, and the modelare publicly available at https://github.com/miccunifi/SEARLE.