HyperAI

COYO-700M Image-text Pair Dataset

Date

a year ago

Size

104.46 GB

Organization

Publish URL

github.com

特色图像

COYO-700M is a large dataset containing 747 million image-text pairs along with many other meta-attributes to improve usability for training various models. This dataset follows a similar strategy as previous vision and language datasets, collecting many informative alternative texts in HTML documents and their associated image pairs.

Data Collection Process

From October 2020 to August 2021, the research team collected approximately 10 billion pairs of alternative text and image sources in HTML documents in CommonCrawl, and eliminated uninformative pairs with minimal cost through a filtering process at the image and text levels. The figure outlines the research team's data collection process.

coyo-700m.torrent
Seeding 2Downloading 1Completed 87Total Downloads 157
  • coyo-700m/
    • README.md
      1.32 KB
    • README.txt
      2.63 KB
      • data/
        • coyo-700m.zip
          104.46 GB