HyperAI

ChineseWebText Chinese Web Text Dataset

Date

a year ago

Size

398.86 GB

Publish URL

huggingface.co

ChineseWebText is the latest and largest Chinese dataset, containing 1.42 TB of data.Each text is assigned a quality score, making it easier for large language model researchers to select data based on new quality thresholds. A cleaner subset containing 600 GB of Chinese text with quality exceeding 90% is also released here. This directory contains the ChineseWebText dataset and the EvalWeb toolchain for processing CommonCrawl data.

ChineseWebText.torrent
Seeding 2Downloading 1Completed 103Total Downloads 279
  • ChineseWebText/
    • README.md
      1.16 KB
    • README.txt
      2.32 KB
      • data/
        • C-webtexet.zip
          398.86 GB