HyperAI

FewJoint Few-shot Joint Learning Benchmark Dataset

The FewJoint benchmark dataset is a collection of real user corpus and expert-constructed corpus from the iFlytek AIUI open platform (ratio is about 3:7), including 59 real domains, and is one of the most domain-rich conversation datasets currently. This dataset avoids constructing simulated domains and is very suitable for small samples and meta-learning methods.

Based on this dataset, the research team also organized the SMP 2020 small sample conversational language understanding evaluation. Unlike previous NLP small sample studies that used artificially constructed simple text classification tasks, the research team introduced conversational language understanding tasks covering 59 real domains. In addition to simple text classification, the SLU task also covers sequence labeling and multi-task joint learning.These more advanced and realistic tasks enable FewJoint to better reflect the difficulty and complexity of real-world NLP tasks than existing simple text classification tasks.

The FewJoint benchmark dataset has the following main features:

  • It contains 59 real domains and is one of the conversation datasets with the most domains. It can avoid constructing simulated domains and is very suitable for evaluating small samples and meta-learning methods.
  • It reflects the difficulty of real NLP tasks and breaks the limitation that the current few-shot NLP can only perform simple artificial tasks such as text classification.
  • Completely open and provides an easy-to-use NLP Few-shot Learning Benchmark.
  • Provides a supporting NLP few-shot learning tool platform - MetaDialog, which facilitates and quickly conducts experiments.

Dataset construction

The research team selected 59 real conversational robot APIs on the iFlytek AIUI open platform as the research area. The sources of user corpus mainly include two parts:

(1) Data from real users of the platform

(2) Corpus constructed by domain experts

The data ratio of the two data sources is about 3:7. After annotating each piece of data with user intent and semantic slots, the research team divided all 59 domains into three parts: 45 training domains, 5 development domains, and 9 test domains. The test and development domain data were reconstructed into a small sample learning form: each domain contains an artificially constructed K-shot support set and a query set consisting of the remaining other data.

FewJoint.torrent
Seeding 1Downloading 1Completed 162Total Downloads 405
  • FewJoint/
    • README.md
      3.45 KB
    • README.txt
      6.9 KB
      • data/
        • FewJoint.zip
          751.82 KB