HyperAIHyperAI

Command Palette

Search for a command to run...

Common Corpus

Date

a year ago

Organization

License

Non-Commercial

Join the Discord Community

Common Corpus is a large, open, and permissioned text dataset of more than 2 trillion tokens released by PleIAs in 2024. It consists of 5 diverse subsets covering a variety of text types, including books, newspapers, scientific articles, government and legal documents, codes, etc. The 5 subsets are:

  • OpenCulture: Contains public domain books, newspapers, and Wikisource content.
  • OpenGovernment: Contains financial and legal documents, such as those from the SEC and WTO.
  • OpenSource: Contains high-quality code on GitHub.
  • OpenScience: Contains academic content such as Open Alex and French papers.
  • OpenWeb: Contains content from sites such as Wikipedia, YouTube Commons, and Stack Exchange.

Common Corpus data can be used for commercial and non-commercial purposes, and supports filtering data by language and year. Although the dataset has removed highly toxic content and personally identifiable information, some bias and sensitive information may still exist. The release of the dataset is accompanied by a detailed technical report to ensure transparency and reproducibility. Common Corpus is supported by multiple organizations and communities including AI Alliance, Jean Zay, and Nvidia Inception program.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Common Corpus | Datasets | HyperAI