HyperAI

PD12M Large-Scale Image-Text Dataset

Public Domain 12M (PD12M for short) is a large-scale image-text dataset created by Spawning in 2024. It contains 12.4 million high-quality public domain and CC0 licensed images with synthetic captions, which are mainly used to train text-to-image models. PD12M is currently the largest public domain image-text dataset. With its large scale and clear copyright statement, it provides a solid foundation for the training of AI models while minimizing copyright concerns. The related paper results are "Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms".

PD12M's data sources include galleries, libraries, archives, museums (GLAM) and Wikimedia Commons, etc. Through careful screening and governance, the quality and security of the data are ensured. The dataset construction process covers multiple steps from image collection, copyright verification, image download, content filtering to subtitle generation. PD12M also introduced a community-driven data governance mechanism through the Source.Plus platform to support the continuous improvement and maintenance of the dataset.

In addition, PD12M has a wide range of applications, mainly used to train and evaluate text-to-image generation models, aiming to promote the development of computer vision and natural language processing. This dataset not only provides rich training resources for the AI field, but also provides a model for responsible AI practices and promotes the protection and use of public AI resources.

PD12M.torrent
Seeding 2Downloading 0Completed 54Total Downloads 63
  • PD12M/
    • README.md
      2.02 KB
    • README.txt
      4.05 KB
      • data/
        • PD12M.zip
          34.77 GB