DuIE Large-Scale Chinese Information Extraction Dataset
Date
3 years ago
Size
242.66 MB
Publish URL
License
非商业用途
Categories
DuIE is a large-scale manually annotated dataset that can be used to evaluate architecture-based knowledge extraction algorithms.
The dataset contains more than 210,000 real-world Chinese sentences, involving more than 450,000 SPO triples (i.e., Subject-Predicate-Object triples), consisting of a pre-specified structure and 49 predicates.
All sentences in this dataset are extracted from Baidu Baike and Baidu News Search. The texts in this dataset cover various fields in real-world applications, such as news, entertainment, and user-generated content.
The dataset consists of the following data:
- 214,590 sentences, of which:
- 172,983 sentences are used as training set;
- 21,626 sentences are for development set;
- 19,981 sentences are used as the test set.
- 457,866 instances, of which:
- 363,960 instances are training set;
- 45,558 instances are development set;
- 48,348 instances are in the test set.
Example data:

DuIE.torrent
Seeding 1Downloading 1Completed 434Total Downloads 1,095