Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of manystate-of-the-art systems. While they excel at recognizing common genericconcepts, they still struggle on fine-grained entities which are rare, or evenabsent from the pre-training dataset. Hence, a key ingredient to their successhas been the use of large-scale curated pre-training data aiming at expandingthe set of concepts that they can memorize during the pre-training stage. Inthis work, we explore an alternative to encoding fine-grained knowledgedirectly into the model's parameters: we instead train the model to retrievethis knowledge from an external memory. Specifically, we propose to equipexisting vision-text models with the ability to refine their embedding withcross-modal retrieved information from a memory at inference time, whichgreatly improves their zero-shot predictions. Remarkably, we show that this canbe done with a light-weight, single-layer, fusion transformer on top of afrozen CLIP. Our experiments validate that our retrieval-enhanced contrastive(RECO) training improves CLIP performance substantially on several challengingfine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and+7.3 on the recent OVEN benchmark, where we even outperform the fine-tunedmodels on unseen classes.