ChineseWebText Chinese Web Text Dataset
ChineseWebText is the latest and largest Chinese dataset, containing 1.42 TB of data.Each text is assigned a quality score, making it easier for large language model researchers to select data based on new quality thresholds. A cleaner subset containing 600 GB of Chinese text with quality exceeding 90% is also released here. This directory contains the ChineseWebText dataset and the EvalWeb toolchain for processing CommonCrawl data.
ChineseWebText.torrent
Seeding 1Downloading 0Completed 167Total Downloads 378