HyperAI

Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and Answers...

特色图像

The rapid development of medical artificial intelligence is inseparable from the support of high-quality data sets. From disease diagnosis to drug development to personalized medicine, data sets play an indispensable role in promoting the application of machine vision, large models, etc. in the medical field.

Medical datasets come in various forms, covering data resources of different dimensions and fields. For example, in the field of disease diagnosis, question-answering datasets such as RJUA-QA promote the automated application of complex medical knowledge; in the field of traditional Chinese medicine, the Shennong Chinese Medicine dataset integrates traditional Chinese medicine literature, clinical cases, and prescription data.

To this end, this article has compiled 10 data sets in the medical field, covering Shennong Traditional Chinese Medicine, ancient Chinese medicine books, medical reasoning, medical Q&A, etc. The aim is to help researchers quickly understand the distribution and characteristics of these data resources and provide inspiration for their application in specific research problems.

Click to view more open source datasets:

https://go.hyper.ai/SjWDr

Scan the QR code and remark "dataset" to join the discussion group↓

Summary of Chinese Medical Datasets

1. The first Chinese medical specialty question-answering reasoning dataset

Estimated size:2.34 MB

Download address:https://go.hyper.ai/rIwcK

This dataset is an innovative medical urology professional question-answering reasoning dataset, created by the Ant Group Medical LLM (Large Language Model) team and the urology expert team of Renji Hospital affiliated to Shanghai Jiao Tong University School of Medicine. It is presented in the Q-context-A (question-context-answer) format, and the case data is written by professional doctors based on clinical experience, without involving any personal privacy of patients and doctors.

2. Chinese Medical Question Answering Dataset

Estimated size:279.64 MB

Download address:https://go.hyper.ai/lM5sd

This dataset is a Chinese medical question-answering dataset, which contains 6 folders of different medical departments, namely: Andrology (94,596 Q&A pairs), Internal Medicine (220,606 Q&A pairs), Obstetrics and Gynecology (183,751 Q&A pairs), Oncology (75,553 Q&A pairs), Pediatrics (101,602 Q&A pairs), and Surgery (115,991 Q&A pairs), totaling 792,099 data. There is one csv file in each folder.

3. Medical dialogue dataset

Estimated size:118.35 MB

Download address:https://go.hyper.ai/MCH57

This is an experimental dataset designed for running medical chatbots, which contains 256,916 conversations between patients and doctors.

4. Shennong Traditional Chinese Medicine Dataset

Estimated size:28.98 MB

Download address:https://go.hyper.ai/iJsGu

This dataset is specially designed for large-scale language model training and evaluation in the field of traditional Chinese medicine. It contains more than 110,000 instruction data, which are generated through an entity-centric self-instruction method. It focuses on the core entities and different intent scenarios in the field of traditional Chinese medicine, which can not only improve the model's ability to answer questions related to traditional Chinese medicine, but also assist in traditional Chinese medicine diagnosis and provide personalized medical advice.

5. Traditional Chinese Medicine Ancient Books Dataset

Estimated size:80.49 MB

Download address:https://go.hyper.ai/pyHEs

This dataset contains about 700 ancient Chinese medicine texts, covering medical classics from the pre-Qin period to the late Qing Dynasty and the Republic of China. These documents not only include medical theories, prescriptions, pharmacology, etc., but also contain rich clinical cases and medical encyclopedia knowledge.

6. Traditional Chinese Medicine Diagnosis Dataset

Estimated size:341.69 MB

Download address:https://go.hyper.ai/cIHaP

This dataset is a high-quality dataset focusing on the field of traditional Chinese medicine, containing about 1GB of high-quality content such as clinical cases in various fields of traditional Chinese medicine, famous books, medical encyclopedias, and glossaries. The dataset is mainly composed of internal data from non-network sources. 99% is in simplified Chinese, with excellent quality and considerable information density, suitable for pre-training or continued pre-training purposes.

7. Traditional Chinese Medicine Dialogue Dataset

Estimated size:737.32 MB

Download address:https://go.hyper.ai/cCrcT

This Chinese medical dataset is a comprehensive resource for developing and training language models that can provide professional conversations and advice in the medical field. It combines multiple types of data, including encyclopedia knowledge, textbook texts, actual doctor-patient conversations, and evaluation data, aiming to improve the accuracy and practicality of the model.

8. Medical Reasoning Dataset

Download address:https://go.hyper.ai/BAVNR

This dataset was released by the Chinese University of Hong Kong and Shenzhen Institute of Big Data in 2024. It is designed specifically for fine-tuning the HuatuoGPT-o1 medical large language model to improve its performance in complex medical reasoning tasks.

9. Multilingual Medical Proficiency Test Benchmark Dataset

Estimated size:20.69 MB

Download address:https://go.hyper.ai/ux6FF

This dataset is a comprehensive multilingual medical proficiency test benchmark dataset developed by the Smart Healthcare Team of the School of Artificial Intelligence of Shanghai Jiao Tong University in 2024. It aims to evaluate the development of multilingual models in the medical field and covers 6 languages and 21 medical sub-fields.

10 , MMedC Large-Scale Multilingual Medical Corpus

Estimated size:31.05 GB

Download address:https://go.hyper.ai/K8RcQ

This dataset is a multilingual medical corpus built by the Smart Healthcare Team of the School of Artificial Intelligence of Shanghai Jiao Tong University in 2024. It contains approximately 25.5 billion tokens covering 6 major languages: English, Chinese, Japanese, French, Russian and Spanish.

The above is the Chinese medical dataset compiled by HyperAI. If you have resources that you want to include on the hyper.ai official website, you are welcome to leave a message or submit a contribution to tell us!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1300+ public data sets

* Includes 400+ classic and popular online tutorials

* Interpretation of 200+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai