Noisy Correspondence Learning with Meta Similarity Correction

Despite the success of multimodal learning in cross-modal retrieval task, theremarkable progress relies on the correct correspondence among multimedia data.However, collecting such ideal data is expensive and time-consuming. Inpractice, most widely used datasets are harvested from the Internet andinevitably contain mismatched pairs. Training on such noisy correspondencedatasets causes performance degradation because the cross-modal retrievalmethods can wrongly enforce the mismatched data to be similar. To tackle thisproblem, we propose a Meta Similarity Correction Network (MSCN) to providereliable similarity scores. We view a binary classification task as themeta-process that encourages the MSCN to learn discrimination from positive andnegative meta-data. To further alleviate the influence of noise, we design aneffective data purification strategy using meta-data as prior knowledge toremove the noisy samples. Extensive experiments are conducted to demonstratethe strengths of our method in both synthetic and real-world noises, includingFlickr30K, MS-COCO, and Conceptual Captions.