CCI Chinese Internet Corpus
Date
Publish URL
Tags
Categories
With the rapid development of large language models, the demand for high-quality datasets in industry and academia continues to grow. These datasets not only need to contain massive amounts of information, but also need to be strictly screened and cleaned to ensure their accuracy and the security of downstream models and applications. However, the current popular public datasets in the industry have certain quality and security risks, especially in the Chinese field where high-quality datasets are particularly scarce. In addition, there are many challenges in building a secure Chinese dataset. Therefore, building a dataset that has been strictly screened and standardized is particularly important for the innovation and development of LLMs.
Chinese Corpora Internet (CCI)It is composed of high-quality, trusted sources from mainland China Internet websites. CCI undergoes strict data cleaning and deduplication, and conducts targeted testing and filtering in terms of content quality. Data processing rules include:
- Rule-based filtering: density-based extraction, keyword filtering, spam filtering, simplified and traditional Chinese conversion, etc.
- Model-based filtering: filtering low-quality content by training classification models;
- Deduplication: Deduplication of data within and between datasets.
In addition, in order to address the problem that the large scale of pre-training data can easily lead to evaluation data leakage, the research team conducted strict screening and filtering of several mainstream evaluation data sets in China during the data processing stage.
The size of the released CCI corpus (CCI v1.0.0) is 104 GB. The overall time span of the dataset is from January 2001 to November 2023.