HyperAI

Common Corpus is a large-scale open text dataset, and the related paper results are:Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training". This dataset only contains copyright-free or permissively licensed data to avoid intellectual property risks. It is currently the largest open-licensed text dataset.

The dataset contains 2 trillion tokens, covering books, scientific literature, codes, legal documents and other fields. The main languages are English and French. It also includes 8 languages with over 10 billion tokens (German/Spanish/Italian, etc.) and 33 languages with over 1 billion tokens.

Core subset of the dataset:

OpenCulture: Public domain books, newspapers (such as Wikisource, Project Gutenberg), including OCR-corrected historical documents.

OpenGovernment: Legal and administrative documents (e.g. SEC reports, WTO filings, European Parliament data).

OpenSource: GitHub high-quality code, the top 80% high-quality submissions screened by the ArmoRM tool.

OpenScience: Academic resources such as OpenAlex, which retain structured information such as formulas and charts.

OpenWeb: Web texts such as Wikipedia, YouTube Commons, Stack Exchange, etc.

OpenSemantic: Natural language transcription of semantic triples from Wikidata, supporting 300+ languages.

Common Corpus Large-scale Open Text Dataset

Core subset of the dataset:

Build AI with AI

Hyper Newsletters

Command Palette

Common Corpus Large-scale Open Text Dataset

Core subset of the dataset:

Build AI with AI

Hyper Newsletters