COYO-700M Image-text Pair Dataset
Date
a year ago
Size
104.46 GB
Publish URL
Categories

COYO-700M is a large dataset containing 747 million image-text pairs along with many other meta-attributes to improve usability for training various models. This dataset follows a similar strategy as previous vision and language datasets, collecting many informative alternative texts in HTML documents and their associated image pairs.
Data Collection Process
From October 2020 to August 2021, the research team collected approximately 10 billion pairs of alternative text and image sources in HTML documents in CommonCrawl, and eliminated uninformative pairs with minimal cost through a filtering process at the image and text levels. The figure outlines the research team's data collection process.
coyo-700m.torrent
Seeding 2Downloading 1Completed 87Total Downloads 157