MCTS Chinese Text Simplified Dataset
Date
Size
Publish URL
Categories
MCTS stands for Multi-Reference Chinese Text Simplification Dataset, which is a Chinese text simplification dataset released in 2024 by a research team from Beijing Language and Culture University, Northeastern University, and Tsinghua University.MCTS: A Multi-Reference Chinese Text Simplification Dataset", aims to provide rich resources and support for text simplification tasks in the field of natural language processing.
The dataset contains 723 complex structured sentences selected from news corpora based on the Penn Chinese Treebank (CTB) standard, and each sentence is equipped with multiple manually simplified versions, making it the largest and most referenced evaluation dataset for the Chinese text simplification task. In addition, MCTS also defines three types of sentence rewriting methods: paraphrase, sentence compression, and structural transformation. Such diversity covers different text simplification strategies.
The MCTS dataset is not only suitable for research fields such as graded reading and machine translation, but can also help language learners better understand and process complex texts.
In terms of usage, MCTS provides parallel data for training, which can be used to train and optimize the Chinese text simplification model. At the same time, researchers can also quantify the performance of the system by comparing the simplified text generated by the system with multiple reference simplified versions in the dataset, using automatic evaluation indicators such as SARI, BLEU and HSK Level.