Use this Dataset

Discuss on Discord

Date

2 years ago

Size

939.48 MB

Organization

Publish URL

Paper URL

Tags

Natural Language Processing

LCCC (Large-scale Cleaned Chinese Conversation corpus) was released by Tsinghua University and Samsung China Research Institute in 2020. The dataset mainly consists of two parts: LCCC-base (6.8 million dialogues) and LCCC-large (12 million dialogues). The research team designed a strict data filtering process to ensure the quality of the dialogue data in the dataset. The process is based on a set of rules and a classifier trained on 110K manually annotated dialogue pairs. The noise filtered by the research team includes: dirty words, special characters, emoticons, grammatically incorrect sentences, and irrelevant dialogues in the context. The cleaned dataset and pre-trained model will promote the research of short text dialogue modeling.

LCCC.torrent

Seeding 2Downloading 0Completed 325Total Downloads 578

LCCC/
- README.md
  1.38 KB
- README.txt
  2.76 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Use this Dataset

Discuss on Discord

Date

2 years ago

Size

939.48 MB

Organization

Publish URL

Paper URL

arxiv.org

Tags

Natural Language Processing

LCCC (Large-scale Cleaned Chinese Conversation corpus) was released by Tsinghua University and Samsung China Research Institute in 2020. The dataset mainly consists of two parts: LCCC-base (6.8 million dialogues) and LCCC-large (12 million dialogues). The research team designed a strict data filtering process to ensure the quality of the dialogue data in the dataset. The process is based on a set of rules and a classifier trained on 110K manually annotated dialogue pairs. The noise filtered by the research team includes: dirty words, special characters, emoticons, grammatically incorrect sentences, and irrelevant dialogues in the context. The cleaned dataset and pre-trained model will promote the research of short text dialogue modeling.

LCCC.torrent

Seeding 2Downloading 0Completed 325Total Downloads 578

LCCC/
- README.md
  1.38 KB
- README.txt
  2.76 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp