Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

We extend the task of composed image retrieval, where an input query consistsof an image and short textual description of how to modify the image. Existingmethods have only been applied to non-complex images within narrow domains,such as fashion products, thereby limiting the scope of study on in-depthvisual reasoning in rich image and language contexts. To address this issue, wecollect the Compose Image Retrieval on Real-life images (CIRR) dataset, whichconsists of over 36,000 pairs of crowd-sourced, open-domain images withhuman-generated modifying text. To extend current methods to the open-domain,we propose CIRPLANT, a transformer based model that leverages rich pre-trainedvision-and-language (V&L) knowledge for modifying visual features conditionedon natural language. Retrieval is then done by nearest neighbor lookup on themodified features. We demonstrate that with a relatively simple architecture,CIRPLANT outperforms existing methods on open-domain images, while matchingstate-of-the-art accuracy on the existing narrow datasets, such as fashion.Together with the release of CIRR, we believe this work will inspire furtherresearch on composed image retrieval.