2 months ago

Learning Transferable Visual Models From Natural Language Supervision

Radford, Alec ; Kim, Jong Wook ; Hallacy, Chris ; Ramesh, Aditya ; Goh, Gabriel ; Agarwal, Sandhini ; Sastry, Girish ; Askell, Amanda ; Mishkin, Pamela ; Clark, Jack ; Krueger, Gretchen ; Sutskever, Ilya

View Paper Details

Learning Transferable Visual Models From Natural Language Supervision

Abstract

State-of-the-art computer vision systems are trained to predict a fixed setof predetermined object categories. This restricted form of supervision limitstheir generality and usability since additional labeled data is needed tospecify any other visual concept. Learning directly from raw text about imagesis a promising alternative which leverages a much broader source ofsupervision. We demonstrate that the simple pre-training task of predictingwhich caption goes with which image is an efficient and scalable way to learnSOTA image representations from scratch on a dataset of 400 million (image,text) pairs collected from the internet. After pre-training, natural languageis used to reference learned visual concepts (or describe new ones) enablingzero-shot transfer of the model to downstream tasks. We study the performanceof this approach by benchmarking on over 30 different existing computer visiondatasets, spanning tasks such as OCR, action recognition in videos,geo-localization, and many types of fine-grained object classification. Themodel transfers non-trivially to most tasks and is often competitive with afully supervised baseline without the need for any dataset specific training.For instance, we match the accuracy of the original ResNet-50 on ImageNetzero-shot without needing to use any of the 1.28 million training examples itwas trained on. We release our code and pre-trained model weights athttps://github.com/OpenAI/CLIP.