Common Corpus Large-scale Open Text Dataset
Common Corpus is a large-scale open text dataset, and the related paper results are:Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training". This dataset only contains copyright-free or permissively licensed data to avoid intellectual property risks. It is currently the largest open-licensed text dataset.
The dataset contains 2 trillion tokens, covering books, scientific literature, codes, legal documents and other fields. The main languages are English and French. It also includes 8 languages with over 10 billion tokens (German/Spanish/Italian, etc.) and 33 languages with over 1 billion tokens.
Core subset of the dataset:
- OpenCulture: Public domain books, newspapers (such as Wikisource, Project Gutenberg), including OCR-corrected historical documents.
- OpenGovernment: Legal and administrative documents (e.g. SEC reports, WTO filings, European Parliament data).
- OpenSource: GitHub high-quality code, the top 80% high-quality submissions screened by the ArmoRM tool.
- OpenScience: Academic resources such as OpenAlex, which retain structured information such as formulas and charts.
- OpenWeb: Web texts such as Wikipedia, YouTube Commons, Stack Exchange, etc.
- OpenSemantic: Natural language transcription of semantic triples from Wikidata, supporting 300+ languages.