HyperAIHyperAI
2 months ago

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Ventura, Lucas ; Yang, Antoine ; Schmid, Cordelia ; Varol, Gül
CoVR-2: Automatic Data Construction for Composed Video Retrieval
Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task thatconsiders both text and image queries together, to search for relevant imagesin a database. Most CoIR approaches require manually annotated datasets,comprising image-text-image triplets, where the text describes a modificationfrom the query image to the target image. However, manual curation of CoIRtriplets is expensive and prevents scalability. In this work, we insteadpropose a scalable automatic dataset creation methodology that generatestriplets given video-caption pairs, while also expanding the scope of the taskto include composed video retrieval (CoVR). To this end, we mine paired videoswith a similar caption from a large database, and leverage a large languagemodel to generate the corresponding modification text. Applying thismethodology to the extensive WebVid2M collection, we automatically constructour WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, weintroduce a new benchmark for CoVR with a manually annotated evaluation set,along with baseline results. We further validate that our methodology isequally applicable to image-caption pairs, by generating 3.3 million CoIRtraining triplets using the Conceptual Captions dataset. Our model builds onBLIP-2 pretraining, adapting it to composed video (or image) retrieval, andincorporates an additional caption retrieval loss to exploit extra supervisionbeyond the triplet. We provide extensive ablations to analyze the designchoices on our new CoVR benchmark. Our experiments also demonstrate thattraining a CoVR model on our datasets effectively transfers to CoIR, leading toimproved state-of-the-art performance in the zero-shot setup on the CIRR,FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publiclyavailable at https://imagine.enpc.fr/~ventural/covr/.

CoVR-2: Automatic Data Construction for Composed Video Retrieval | Latest Papers | HyperAI