LCCC Large Clean Chinese Conversational Corpus
Date
Size
Publish URL
Categories
LCCC (Large-scale Cleaned Chinese Conversation corpus) was released by Tsinghua University and Samsung China Research Institute in 2020.
The dataset mainly consists of two parts: LCCC-base (6.8 million dialogues) and LCCC-large (12 million dialogues). The research team designed a strict data filtering process to ensure the quality of the dialogue data in the dataset. The process is based on a set of rules and a classifier trained on 110K manually annotated dialogue pairs. The noise filtered by the research team includes: dirty words, special characters, emoticons, grammatically incorrect sentences, and irrelevant dialogues in the context. The cleaned dataset and pre-trained model will promote the research of short text dialogue modeling.