HyperAIHyperAI

Command Palette

Search for a command to run...

Common Corpus Large-scale Open Text Dataset

Date

5 months ago

Paper URL

arxiv.org

Join the Discord Community

Common Corpus is a large-scale open text dataset, and the related paper results are:Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training". This dataset only contains copyright-free or permissively licensed data to avoid intellectual property risks. It is currently the largest open-licensed text dataset.

The dataset contains 2 trillion tokens, covering books, scientific literature, codes, legal documents and other fields. The main languages are English and French. It also includes 8 languages with over 10 billion tokens (German/Spanish/Italian, etc.) and 33 languages with over 1 billion tokens.

Core subset of the dataset:

  • OpenCulture: Public domain books, newspapers (such as Wikisource, Project Gutenberg), including OCR-corrected historical documents.
  • OpenGovernment: Legal and administrative documents (e.g. SEC reports, WTO filings, European Parliament data).
  • OpenSource: GitHub high-quality code, the top 80% high-quality submissions screened by the ArmoRM tool.
  • OpenScience: Academic resources such as OpenAlex, which retain structured information such as formulas and charts.
  • OpenWeb: Web texts such as Wikipedia, YouTube Commons, Stack Exchange, etc.
  • OpenSemantic: Natural language transcription of semantic triples from Wikidata, supporting 300+ languages.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp