HyperAI

Common Corpus Large-scale Open Text Dataset

Date

7 days ago

Publish URL

huggingface.co

Categories

Download Help

Common Corpus is a large-scale open text dataset, and the related paper results are:Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training". This dataset only contains copyright-free or permissively licensed data to avoid intellectual property risks. It is currently the largest open-licensed text dataset.

The dataset contains 2 trillion tokens, covering books, scientific literature, codes, legal documents and other fields. The main languages are English and French. It also includes 8 languages with over 10 billion tokens (German/Spanish/Italian, etc.) and 33 languages with over 1 billion tokens.

Core subset of the dataset:

  • OpenCulture: Public domain books, newspapers (such as Wikisource, Project Gutenberg), including OCR-corrected historical documents.
  • OpenGovernment: Legal and administrative documents (e.g. SEC reports, WTO filings, European Parliament data).
  • OpenSource: GitHub high-quality code, the top 80% high-quality submissions screened by the ArmoRM tool.
  • OpenScience: Academic resources such as OpenAlex, which retain structured information such as formulas and charts.
  • OpenWeb: Web texts such as Wikipedia, YouTube Commons, Stack Exchange, etc.
  • OpenSemantic: Natural language transcription of semantic triples from Wikidata, supporting 300+ languages.