HyperAIHyperAI

Command Palette

Search for a command to run...

Common Corpus-zh Chinese Public Domain Dataset

Date

2 years ago

Size

225.16 MB

Organization

Hugging Face

Common Corpus was jointly created by Pleias, HuggingFace and other organizations.It is the largest public domain dataset currently available.Specifically designed for training large language models (LLMs).The dataset contains 500 billion words from diverse cultural heritage projects around the world.It includes multiple languages including English, French, Chinese, Spanish, German and Italian, and is the most comprehensive language resource library to date.

It contains the largest English dataset to date, including 180 billion words, 21 million documents from Chronicling America, an important American digital newspaper project, Nomic AI's original corpus map, and monograph data collected by Sebastian Majstorovic. In addition, Common Corpus also contains the largest open datasets for French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian, as well as some low-resource languages that are rarely involved in large-scale language model training.

The launch of this dataset demonstrates that LLMs can be trained even without relying on copyright-restricted content such as Common Crawl. It aims to build a powerful AI data sharing platform, simplify the research process, improve the reproducibility of research, promote the popularization, diversity and democratization of AI, and ensure the knowledge dissemination and application of large models.

Common-Corpus-zh.torrent
Seeding 1Downloading 0Completed 124Total Downloads 336
  • Common-Corpus-zh/
    • README.md
      1.93 KB
    • README.txt
      3.86 KB
      • data/
        • Chinese-PD.zip
          225.16 MB

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp