Conceptual Captions Dataset (CC12M)

The dataset was released by Google in 2018 and includes 3.3 million image-caption pairs. The team created an automatic pipeline to extract, filter, and process candidate image and text pairs from billions of web pages.
The dataset is divided into training, validation and test sets. The training set consists of 3,318,333 image URL/title pairs, and the total number of token types (i.e. vocabulary) in the title is 51,201. Each title contains an average of 10.3 tokens, and the validation set consists of 15,840 image URL/title pairs.
In addition, the team provided machine-generated image labels for 2,007,528 image URL/title pairs in the training set.
Related papers:
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning