Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve imagesbased on a multimodal query. Typical training data consists of tripletscontaining a reference image, a textual description of desired modifications,and the target image, which are expensive and time-consuming to acquire. Thescarcity of CIR datasets has led to zero-shot approaches utilizing synthetictriplets or leveraging vision-language models (VLMs) with ubiquitousweb-crawled image-caption pairs. However, these methods have significantlimitations: synthetic triplets suffer from limited scale, lack of diversity,and unnatural modification text, while image-caption pairs hinder jointembedding learning of the multimodal query due to the absence of triplet data.Moreover, existing approaches struggle with complex and nuanced modificationtexts that demand sophisticated fusion and understanding of vision and languagemodalities. We present CoLLM, a one-stop framework that effectively addressesthese limitations. Our approach generates triplets on-the-fly fromimage-caption pairs, enabling supervised training without manual annotation. Weleverage Large Language Models (LLMs) to generate joint embeddings of referenceimages and modification texts, facilitating deeper multimodal fusion.Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale datasetcomprising 3.4M samples, and refine existing CIR benchmarks (CIRR andFashion-IQ) to enhance evaluation reliability. Experimental results demonstratethat CoLLM achieves state-of-the-art performance across multiple CIR benchmarksand settings. MTCIR yields competitive results, with up to 15% performanceimprovement. Our refined benchmarks provide more reliable evaluation metricsfor CIR models, contributing to the advancement of this important field.

Source PDF