Command Palette
Search for a command to run...
DuIE Large-Scale Chinese Information Extraction Dataset
Date
Size
Publish URL
Paper URL
License
Non-Commercial
DuIE is a large-scale manually annotated dataset that can be used to evaluate architecture-based knowledge extraction algorithms.
The dataset contains more than 210,000 real-world Chinese sentences, involving more than 450,000 SPO triples (i.e., Subject-Predicate-Object triples), consisting of a pre-specified structure and 49 predicates.
All sentences in this dataset are extracted from Baidu Baike and Baidu News Search. The texts in this dataset cover various fields in real-world applications, such as news, entertainment, and user-generated content.
The dataset consists of the following data:
- 214,590 sentences, of which:
- 172,983 sentences are used as training set;
- 21,626 sentences are for development set;
- 19,981 sentences are used as the test set.
- 457,866 instances, of which:
- 363,960 instances are training set;
- 45,558 instances are development set;
- 48,348 instances are in the test set.
Example data:

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.