Zero-Shot Composed Image Retrieval with Textual Inversion

Composed Image Retrieval (CIR) aims to retrieve a target image based on aquery composed of a reference image and a relative caption that describes thedifference between the two images. The high effort and cost required forlabeling datasets for CIR hamper the widespread usage of existing methods, asthey rely on supervised learning. In this work, we propose a new task,Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeledtraining dataset. Our approach, named zero-Shot composEd imAge Retrieval withtextuaL invErsion (SEARLE), maps the visual features of the reference imageinto a pseudo-word token in CLIP token embedding space and integrates it withthe relative caption. To support research on ZS-CIR, we introduce anopen-domain benchmarking dataset named Composed Image Retrieval on CommonObjects in context (CIRCO), which is the first dataset for CIR containingmultiple ground truths for each query. The experiments show that SEARLEexhibits better performance than the baselines on the two main datasets for CIRtasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code andthe model are publicly available at https://github.com/miccunifi/SEARLE.