17 days ago

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

Abstract

Despite the rapidly growing demand for multimodal retrieval, progress in thisfield remains severely constrained by a lack of training data. In this paper,we introduce MegaPairs, a novel data synthesis method that leverages visionlanguage models (VLMs) and open-domain images, together with a massivesynthetic dataset generated from this method. Our empirical analysis shows thatMegaPairs generates high-quality data, enabling the multimodal retriever tosignificantly outperform the baseline model trained on 70times more datafrom existing datasets. Moreover, since MegaPairs solely relies on generalimage corpora and open-source VLMs, it can be easily scaled up, enablingcontinuous improvements in retrieval performance. In this stage, we producedmore than 26 million training instances and trained several models of varyingsizes using this data. These new models achieve state-of-the-art zero-shotperformance across 4 popular composed image retrieval (CIR) benchmarks and thehighest overall performance on the 36 datasets provided by MMEB. They alsodemonstrate notable performance improvements with additional downstreamfine-tuning. Our produced dataset, well-trained models, and data synthesispipeline will be made publicly available to facilitate the future developmentof this field.