Common Corpus
Date
Publish URL
License
非商业用途
Categories
Common Corpus is a large, open, and permissioned text dataset of more than 2 trillion tokens released by PleIAs in 2024. It consists of 5 diverse subsets covering a variety of text types, including books, newspapers, scientific articles, government and legal documents, codes, etc. The 5 subsets are:
- OpenCulture: Contains public domain books, newspapers, and Wikisource content.
- OpenGovernment: Contains financial and legal documents, such as those from the SEC and WTO.
- OpenSource: Contains high-quality code on GitHub.
- OpenScience: Contains academic content such as Open Alex and French papers.
- OpenWeb: Contains content from sites such as Wikipedia, YouTube Commons, and Stack Exchange.
Common Corpus data can be used for commercial and non-commercial purposes, and supports filtering data by language and year. Although the dataset has removed highly toxic content and personally identifiable information, some bias and sensitive information may still exist. The release of the dataset is accompanied by a detailed technical report to ensure transparency and reproducibility. Common Corpus is supported by multiple organizations and communities including AI Alliance, Jean Zay, and Nvidia Inception program.