Date

2 years ago

Organization

Tags

With the rapid development of large language models, the demand for high-quality datasets in industry and academia continues to grow. These datasets not only need to contain massive amounts of information, but also need to be strictly screened and cleaned to ensure their accuracy and the security of downstream models and applications. However, the current popular public datasets in the industry have certain quality and security risks, especially in the Chinese field where high-quality datasets are particularly scarce. In addition, there are many challenges in building a secure Chinese dataset. Therefore, building a dataset that has been strictly screened and standardized is particularly important for the innovation and development of LLMs. **Chinese Corpora Internet (CCI)**It is composed of high-quality, trusted sources from mainland China Internet websites. CCI undergoes strict data cleaning and deduplication, and conducts targeted testing and filtering in terms of content quality. Data processing rules include:

Rule-based filtering: density-based extraction, keyword filtering, spam filtering, simplified and traditional Chinese conversion, etc.
Model-based filtering: filtering low-quality content by training classification models;
Deduplication: Deduplication of data within and between datasets. In addition, in order to address the problem that the large scale of pre-training data can easily lead to evaluation data leakage, the research team conducted strict screening and filtering of several mainstream evaluation data sets in China during the data processing stage. The size of the released CCI corpus (CCI v1.0.0) is 104 GB. The overall time span of the dataset is from January 2001 to November 2023.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Discuss on Discord

Date

2 years ago

Organization

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Discuss on Discord

Date

2 years ago

Organization

Related Datasets

Sutra 10B Pretraining Teaching and Training Dataset

3 months ago

Groundsource Global Flood Events Dataset

4 months ago

CL-bench Context Learning Evaluation Benchmark Dataset

4 months ago

LightOnOCR-mix-0126 Text Transcription Dataset

5 months ago

GroundingME Complex Scene Understanding Evaluation Dataset

9 days ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

CCI Chinese Internet Corpus

Build AI with AI

HyperAI Newsletters

Command Palette

CCI Chinese Internet Corpus

Related Datasets

Sutra 10B Pretraining Teaching and Training Dataset

Groundsource Global Flood Events Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

CCI Chinese Internet Corpus

Related Datasets

Sutra 10B Pretraining Teaching and Training Dataset

Groundsource Global Flood Events Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

Sutra 10B Pretraining Teaching and Training Dataset

Groundsource Global Flood Events Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Related Datasets

Sutra 10B Pretraining Teaching and Training Dataset

Groundsource Global Flood Events Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

LightOnOCR-mix-0126 Text Transcription Dataset

GroundingME Complex Scene Understanding Evaluation Dataset