Get 20 Popular LLM Chinese Datasets With One Click

The following article is from OpenBayes Bayesian Computing, author Xiaobei
since ChatGPT Since its launch,The large language model (LLM) has caused a sensation in various fields with its outstanding learning ability.The training and tuning of large models cannot be separated from the support of high-quality and large-scale data. Carefully constructed data sets not only provide sufficient fuel for large models, but also make it possible for the application and performance improvement of large models in vertical fields.
This article sorts out some popular Chinese public datasets suitable for large model training and tuning (arranged in alphabetical order).For everyone to understand and use.
Kind tips:
All the datasets listed in this article can be used directly in model training and deployment with one-click input on the OpenBayes.com platform.
Direct link:
https://openbayes.com/console/public/datasets
1 Ape210K Chinese primary school level mathematics problems
* Issuing Agency:Yuanfudao AI Lab, Northwestern University
* Related tags:Arithmetic tasks, text generation
* Direct use:https://hyper.ai/datasets/28445
Ape210K is a new large-scale and template-rich math word problem dataset.Contains 210k Chinese elementary school level math problems. Each question includes the best answer and the equation needed to arrive at the answer.
2 Belle Dataset
* Issuing Agency:iFlytek, CCL, HFL
* Related tags:Text generation, Chinese
* Direct use:https://hyper.ai/datasets/28451
This dataset uses an evaluation set of 1,000 samples to evaluate various models, covering 9 real-world scenarios.Contains approximately 3.5 million Chinese command data generated by the BELLE project.
3 Chinese Squad
Chinese Machine Reading Comprehension Dataset
* Related tags:Extractive Q&A, Intelligent Q&A
* Direct use:https://hyper.ai/datasets/28476
This dataset is a Chinese machine reading comprehension dataset, which is converted from the original Squad through machine translation and manual correction, including V1.1 and V2.0.
4 CMRC 2018 Chinese Machine Reading Comprehension Evaluation Dataset
* Issuing Agency:iFlytek, CCL, HFL
* Related tags:Text Generation
* Direct use:https://hyper.ai/datasets/28470
This dataset contains the data used by the 2nd iFlytek Cup Chinese Machine Reading Comprehension Evaluation (CMRC 2018) and has been accepted by EMNLP 2019, the top international conference on computational linguistics.
5 CrossWOZ Task-oriented dialogue dataset
* Issuing Agency:Tsinghua University, BNRIST
* Related tags:Question answering dataset, Chinese
* Direct use:https://hyper.ai/datasets/28442
CrossWOZ is the first large-scale task-oriented Chinese cross-domain Wizard-of-Oz oriented dataset.It contains 6k dialogues and 102k sentences from 5 scenarios (attractions, hotels, restaurants, subways, and taxis). In addition, the corpus contains rich dialogue state annotations and dialogue behaviors of both users and the system.
6 DRCD Delta reading comprehension dataset
* Issuing Agency:Delta Research Center, Delta Electronics
* Related tags:Text detection, machine learning
* Direct use:https://hyper.ai/datasets/28473
Delta Reading Comprehension Dataset (DRCD) is a general-purpose Traditional Chinese machine reading comprehension dataset.This dataset aims to become a standard Chinese machine reading comprehension dataset.Contains 10,014 paragraphs from 2,108 Wikipedia articles and more than 30,000 questions generated by human annotators.
7 Douban Conversation Corpus Douban Conversational Corpus
* Issuing Agency:Beihang University, Nankai University, MSR
* Related tags:Question and answer analysis, natural language processing
* Direct use:https://hyper.ai/datasets/28497
This dataset includes a training dataset, a development set, and a test set for a retrieval-based chatbot.The test data contains 1000 conversation contexts.For each context, 10 responses were created as candidates.
8 DuReader Question Answering Dataset
* Issuing Agency:Baidu
* Related tags:Question answering dataset, intelligent question answering
* Direct use:https://hyper.ai/datasets/28461
DuReader is a benchmark dataset and model focusing on the field of machine reading comprehension, mainly used for intelligent question answering tasks.
9 E-KAR Chinese version A benchmark for interpretable knowledge-intensive analogical reasoning
* Issuing Agency:Fudan University, ByteDance AI Lab, Brain Technologies, Inc.
* Related tags:Text generation, natural language processing
* Direct use:https://hyper.ai/datasets/28517
E-KAR stands for Benchmark for Explainable Knowledge-intensive Analogical Reasoning, which is a benchmark for explainable knowledge-intensive analogical reasoning. Existing word analogy test benchmarks cannot reveal the underlying process of analogical reasoning of neural models. Researchers believe that models with reasoning ability should use correct reasons as basic beliefs.Therefore, the first Knowledgeable Interpretable Analogical Reasoning Benchmark (E-KAR) is proposed.The benchmark dataset consists of 1,655 (in Chinese) and 1,251 (in English) questions from the civil service examination, which require extensive background knowledge to solve.
10 FCGEC Chinese grammar error detection and correction dataset
* Issuing Agency:Zhejiang University, Huawei
* Related tags:Text Detection
* Direct use:https://hyper.ai/datasets/28512
FCGEC stands for Fine-Grained Corpus for Chinese Grammatical Error Correction.It is a large-scale multi-reference text error correction corpus of native speakers., used to train and evaluate the error-correcting model system. The data sources are mainly incorrect sentence test questions of primary, middle and high school students and news aggregation websites.
11 KdConv Chinese Multi-domain Conversational Dataset
* Issuing Agency:Tsinghua University
* Related tags:Text Generation
* Direct use:https://hyper.ai/datasets/28507
KdConv is a Chinese multi-domain knowledge-driven conversation dataset that builds topics in multi-round conversations on a knowledge graph. KdConv contains 4.5K conversations from three domains (movies, music, and travel) and 86k utterances with an average turn count of 19.0.Suitable for modeling knowledge interactions in multi-turn human dialogues, including knowledge planning, knowledge base, knowledge adaptation, etc.
12 Math23K Math Words Dataset
* Issuing Agency:Tencent AI Lab
* Related tags:Corpus, math problems
* Direct use:https://hyper.ai/datasets/28504
Math23K stands for Math23K for Math Word Problem Solving.is a dataset created for solving math word problems.Contains 23,162 Chinese questions crawled from the Internet.
13 MedDialog Chinese doctor-patient dialogue dataset
* Related tags:Medical research, conversational datasets
* Direct use:https://hyper.ai/datasets/28483
MedDialog is a large-scale medical conversation dataset containing 1.1 million conversations and 4 million utterances between doctors and patients.
14 ODSQA Open Domain Spoken Question Answering Dataset
* Issuing Agency:National Taiwan University
* Related tags:Intelligent question answering, natural language processing
* Direct use:https://hyper.ai/datasets/28500
The ODSQA dataset is a spoken language dataset for Chinese question answering.It contains over three thousand questions from 20 different speakers.
15 RedGPT Automatically generate factual dialogue datasets
* Related tags:Text generation, natural language processing
* Direct use:https://hyper.ai/datasets/28448
RedGPT stands for Reference-Enlightened-Dialogue by GPT and for GPT. Factual accuracy is a major weakness of ChatGPT. To improve factual accuracy, a large amount of factual dialogue data can be annotated for fine-tuning the GPT model. To avoid the expensive cost of manual annotation,The researchers proposed a method to automatically generate factual dialogues and made some data public (RedGPT-Dataset-V1-CN), which contains a total of 50,000 multi-round dialogues in Chinese.
16 The United Nations Parallel Corpus United Nations Parallel Corpus v1.0
* Issuing Agency:Tsinghua University, BNRIST
* Related tags:Question answering dataset, Chinese
* Direct use:https://hyper.ai/datasets/28464
CrossWOZ is the first large-scale task-oriented Chinese cross-domain Wizard-of-Oz oriented dataset.It contains 6k dialogues and 102k sentences from 5 scenarios (attractions, hotels, restaurants, subways, and taxis). In addition, the corpus contains rich dialogue state annotations and dialogue behaviors of both users and the system.
17 VQA Visual Question Answering Dataset
* Related tags:Visual Question Answering, Question Answering Dataset
* Direct use:https://hyper.ai/datasets/28455
The development of deep learning has promoted the solution of multimodal learning related tasks. Visual Question Answering (VQA) is a very challenging example, which requires high-level scene interpretation from images and modeling combined with relevant question-answering language.Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.This is an end-to-end system implemented using Keras that aims to accomplish this task.
18 WebQA v1.0 Baidu Chinese Question Answering Dataset
* Issuing Agency:Baidu
* Related tags:Deep learning, intelligent question answering
* Direct use:https://hyper.ai/datasets/28467
This is a dataset that Baidu opened in 2016. The data comes from Baidu Knows. The format is a question with multiple articles with basically the same meaning, divided into manual annotation and browser retrieval.
19 XiaChuFang Recipe Corpus Xiachufang Recipe Corpus
* Related tags:Text recognition, text detection
* Direct use:https://1lh.cc/4jaL8b
This recipe corpus contains 1,520,327 Chinese recipes.Of these, 1,242,206 recipes belong to 30,060 dishes. On average, one dish has 41.3 recipes. Recipes were contributed by 415,272 authors. Among them, the most productive author uploaded 5,394 recipes.
* Direct use:https://hyper.ai/datasets/28489
20 XQuAD Cross-lingual Question Answering Dataset
* Issuing Agency:iFlytek, CCL, HFL
* Related tags:Question and answer analysis, reading comprehension
* Direct use:https://hyper.ai/datasets/28458
XQuAD (Cross-Lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 passages and 1,190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016).
One-click input of the above dataset
A rich data set still requires the support of a high-quality computing platform. Currently, the OpenBayes Bayesian computing platform already supports one-click binding of data sets.Just type one key during container creation.The target data set can be bound to the corresponding container, eliminating the tedious downloading and uploading process and not occupying the user's personal storage space.
Video tutorial reference:
[OpenBayes Official Tutorial] Organizational Collaboration_bilibili_bilibili
For detailed documentation, see:https://1lh.cc/v2ao4q
also,The OpenBayes platform also provides more than 500 selected public data sets, models, tutorials and other high-quality resources.And it has been integrated into the "Public Resources" module.
Now experience the fast binding, please visit