HyperAI

DuIE Large-Scale Chinese Information Extraction Dataset

Date

3 years ago

Size

242.66 MB

Organization

Baidu

Publish URL

ai.baidu.com

License

非商业用途

DuIE is a large-scale manually annotated dataset that can be used to evaluate architecture-based knowledge extraction algorithms.

The dataset contains more than 210,000 real-world Chinese sentences, involving more than 450,000 SPO triples (i.e., Subject-Predicate-Object triples), consisting of a pre-specified structure and 49 predicates.

All sentences in this dataset are extracted from Baidu Baike and Baidu News Search. The texts in this dataset cover various fields in real-world applications, such as news, entertainment, and user-generated content.

The dataset consists of the following data:

  • 214,590 sentences, of which:
    • 172,983 sentences are used as training set;
    • 21,626 sentences are for development set;
    • 19,981 sentences are used as the test set.
  • 457,866 instances, of which:
    • 363,960 instances are training set;
    • 45,558 instances are development set;
    • 48,348 instances are in the test set.

Example data:

DuIE.torrent
Seeding 1Downloading 1Completed 434Total Downloads 1,095
  • DuIE/
    • README.md
      1.53 KB
    • README.txt
      3.07 KB
      • data/
        • all_50_schemas
          6.94 KB
        • dev_data.json
          27.1 MB
        • train_data.json
          242.66 MB