ChineseWebText Chinese Web Text Dataset
Date
a year ago
Size
398.86 GB
Publish URL
Tags
Categories
ChineseWebText is the latest and largest Chinese dataset, containing 1.42 TB of data.Each text is assigned a quality score, making it easier for large language model researchers to select data based on new quality thresholds. A cleaner subset containing 600 GB of Chinese text with quality exceeding 90% is also released here. This directory contains the ChineseWebText dataset and the EvalWeb toolchain for processing CommonCrawl data.
ChineseWebText.torrent
Seeding 2Downloading 1Completed 103Total Downloads 279