HyperAI

LCCC Large Clean Chinese Conversational Corpus

Date

a year ago

Size

939.48 MB

Organization

Tsinghua University

Publish URL

github.com

LCCC (Large-scale Cleaned Chinese Conversation corpus) was released by Tsinghua University and Samsung China Research Institute in 2020.

The dataset mainly consists of two parts: LCCC-base (6.8 million dialogues) and LCCC-large (12 million dialogues). The research team designed a strict data filtering process to ensure the quality of the dialogue data in the dataset. The process is based on a set of rules and a classifier trained on 110K manually annotated dialogue pairs. The noise filtered by the research team includes: dirty words, special characters, emoticons, grammatically incorrect sentences, and irrelevant dialogues in the context. The cleaned dataset and pre-trained model will promote the research of short text dialogue modeling.

LCCC.torrent
Seeding 1Downloading 1Completed 129Total Downloads 305
  • LCCC/
    • README.md
      1.38 KB
    • README.txt
      2.76 KB
      • data/
        • lccc.zip
          939.48 MB