HyperAI

Common Corpus was jointly created by Pleias, HuggingFace and other organizations.It is the largest public domain dataset currently available.Specifically designed for training large language models (LLMs).The dataset contains 500 billion words from diverse cultural heritage projects around the world.It includes multiple languages including English, French, Chinese, Spanish, German and Italian, and is the most comprehensive language resource library to date.

It contains the largest English dataset to date, including 180 billion words, 21 million documents from Chronicling America, an important American digital newspaper project, Nomic AI's original corpus map, and monograph data collected by Sebastian Majstorovic. In addition, Common Corpus also contains the largest open datasets for French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian, as well as some low-resource languages that are rarely involved in large-scale language model training.

The launch of this dataset demonstrates that LLMs can be trained even without relying on copyright-restricted content such as Common Crawl. It aims to build a powerful AI data sharing platform, simplify the research process, improve the reproducibility of research, promote the popularization, diversity and democratization of AI, and ensure the knowledge dissemination and application of large models.

Common Corpus-zh Chinese Public Domain Dataset