Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and answers...

a year ago

The rapid development of medical artificial intelligence is inseparable from the support of high-quality data sets. From disease diagnosis to drug development to personalized medicine, data sets play an indispensable role in promoting the application of machine vision, large models, etc. in the medical field.

Medical datasets come in various forms, covering data resources of different dimensions and fields. For example, in the field of disease diagnosis, question-answering datasets such as RJUA-QA promote the automated application of complex medical knowledge; in the field of traditional Chinese medicine, the Shennong Chinese Medicine dataset integrates traditional Chinese medicine literature, clinical cases, and prescription data.

To this end, this article has compiled 10 data sets in the medical field, covering Shennong Traditional Chinese Medicine, ancient Chinese medicine books, medical reasoning, medical Q&A, etc. The aim is to help researchers quickly understand the distribution and characteristics of these data resources and provide inspiration for their application in specific research problems.

Click to view more open source datasets:

https://go.hyper.ai/SjWDr

Scan the QR code and remark "dataset" to join the discussion group↓

Summary of Chinese Medical Datasets

1. MedChatZH Chinese medical conversation command dataset

Estimated size:3.9 GB

Download address:https://go.hyper.ai/AZwFf

MedChatZH is a Chinese medical conversation dataset released by East China University of Science and Technology. It aims to improve the understanding and generation capabilities of Chinese medical consultation dialogues (especially in TCM scenarios) through continuous pre-training on TCM classics and fine-tuning on medical instruction data.

2. RJUA-QA The first Chinese medical specialty question answering reasoning dataset

Estimated size:2.34 MB

Download address:https://go.hyper.ai/rIwcK

This dataset is an innovative medical urology professional question-answering reasoning dataset, created by the Ant Group Medical LLM (Large Language Model) team and the urology expert team of Renji Hospital affiliated to Shanghai Jiao Tong University School of Medicine. It is presented in the Q-context-A (question-context-answer) format, and the case data is written by professional doctors based on clinical experience, without involving any personal privacy of patients and doctors.

3. Chinese Medical Dialogue Data

Estimated size:279.64 MB

Download address:https://go.hyper.ai/lM5sd

This dataset is a Chinese medical question-answering dataset, which contains 6 folders of different medical departments, namely: Andrology (94,596 Q&A pairs), Internal Medicine (220,606 Q&A pairs), Obstetrics and Gynecology (183,751 Q&A pairs), Oncology (75,553 Q&A pairs), Pediatrics (101,602 Q&A pairs), and Surgery (115,991 Q&A pairs), totaling 792,099 data. There is one csv file in each folder.

4. AI Medical Chatbot Medical Conversation Dataset

Estimated size:118.35 MB

Download address:https://go.hyper.ai/MCH57

This is an experimental dataset designed for running medical chatbots, which contains 256,916 conversations between patients and doctors.

5. ShenNong TCM Dataset Shennong Traditional Chinese Medicine Dataset

Estimated size:28.98 MB

Download address:https://go.hyper.ai/iJsGu

This dataset is specially designed for large-scale language model training and evaluation in the field of traditional Chinese medicine. It contains more than 110,000 instruction data, which are generated through an entity-centric self-instruction method. It focuses on the core entities and different intent scenarios in the field of traditional Chinese medicine, which can not only improve the model's ability to answer questions related to traditional Chinese medicine, but also assist in traditional Chinese medicine diagnosis and provide personalized medical advice.

6. TCM Ancient Books Traditional Chinese Medicine Ancient Books Dataset

Estimated size:80.49 MB

Download address:https://go.hyper.ai/pyHEs

This dataset contains about 700 ancient Chinese medicine texts, covering medical classics from the pre-Qin period to the late Qing Dynasty and the Republic of China. These documents not only include medical theories, prescriptions, pharmacology, etc., but also contain rich clinical cases and medical encyclopedia knowledge.

7. Traditional Chinese Medicine Dataset SFT Traditional Chinese Medicine Diagnosis Dataset

Estimated size:341.69 MB

Download address:https://go.hyper.ai/cIHaP

This dataset is a high-quality dataset focusing on the field of traditional Chinese medicine, containing about 1GB of high-quality content such as clinical cases in various fields of traditional Chinese medicine, famous books, medical encyclopedias, and glossaries. The dataset is mainly composed of internal data from non-network sources. 99% is in simplified Chinese, with excellent quality and considerable information density, suitable for pre-training or continued pre-training purposes.

8. Chinese Medical Dialogue Dataset

Estimated size:737.32 MB

Download address:https://go.hyper.ai/cCrcT

This Chinese medical dataset is a comprehensive resource for developing and training language models that can provide professional conversations and advice in the medical field. It combines multiple types of data, including encyclopedia knowledge, textbook texts, actual doctor-patient conversations, and evaluation data, aiming to improve the accuracy and practicality of the model.

9. Medical o1 Reasoning SFT Medical Reasoning Dataset

Download address:https://go.hyper.ai/BAVNR

This dataset was released by the Chinese University of Hong Kong and Shenzhen Institute of Big Data in 2024. It is designed specifically for fine-tuning the HuatuoGPT-o1 medical large language model to improve its performance in complex medical reasoning tasks.

10. MMedBench Multilingual Medical Proficiency Test Benchmark Dataset

Estimated size:20.69 MB

Download address:https://go.hyper.ai/ux6FF

This dataset is a comprehensive multilingual medical proficiency test benchmark dataset developed by the Smart Healthcare Team of the School of Artificial Intelligence of Shanghai Jiao Tong University in 2024. It aims to evaluate the development of multilingual models in the medical field and covers 6 languages and 21 medical sub-fields.

11 , MMedC Large-Scale Multilingual Medical Corpus

Estimated size:31.05 GB

Download address:https://go.hyper.ai/K8RcQ

This dataset is a multilingual medical corpus built by the Smart Healthcare Team of the School of Artificial Intelligence of Shanghai Jiao Tong University in 2024. It contains approximately 25.5 billion tokens covering 6 major languages: English, Chinese, Japanese, French, Russian and Spanish.

The above is the Chinese medical dataset compiled by HyperAI. If you have resources that you want to include on the hyper.ai official website, you are welcome to leave a message or submit a contribution to tell us!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1300+ public data sets

* Includes 400+ classic and popular online tutorials

* Interpretation of 200+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai

Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and answers...

a year ago

Information

Artificial Intelligence

Click to view more open source datasets:

https://go.hyper.ai/SjWDr

Scan the QR code and remark "dataset" to join the discussion group↓

Summary of Chinese Medical Datasets

1. MedChatZH Chinese medical conversation command dataset

Estimated size:3.9 GB

Download address:https://go.hyper.ai/AZwFf

2. RJUA-QA The first Chinese medical specialty question answering reasoning dataset

Estimated size:2.34 MB

Download address:https://go.hyper.ai/rIwcK

3. Chinese Medical Dialogue Data

Estimated size:279.64 MB

Download address:https://go.hyper.ai/lM5sd

4. AI Medical Chatbot Medical Conversation Dataset

Estimated size:118.35 MB

Download address:https://go.hyper.ai/MCH57

This is an experimental dataset designed for running medical chatbots, which contains 256,916 conversations between patients and doctors.

5. ShenNong TCM Dataset Shennong Traditional Chinese Medicine Dataset

Estimated size:28.98 MB

Download address:https://go.hyper.ai/iJsGu

6. TCM Ancient Books Traditional Chinese Medicine Ancient Books Dataset

Estimated size:80.49 MB

Download address:https://go.hyper.ai/pyHEs

7. Traditional Chinese Medicine Dataset SFT Traditional Chinese Medicine Diagnosis Dataset

Estimated size:341.69 MB

Download address:https://go.hyper.ai/cIHaP

8. Chinese Medical Dialogue Dataset

Estimated size:737.32 MB

Download address:https://go.hyper.ai/cCrcT

9. Medical o1 Reasoning SFT Medical Reasoning Dataset

Download address:https://go.hyper.ai/BAVNR

10. MMedBench Multilingual Medical Proficiency Test Benchmark Dataset

Estimated size:20.69 MB

Download address:https://go.hyper.ai/ux6FF

11 , MMedC Large-Scale Multilingual Medical Corpus

Estimated size:31.05 GB

Download address:https://go.hyper.ai/K8RcQ

About HyperAI

* Provide domestic accelerated download nodes for 1300+ public data sets

* Includes 400+ classic and popular online tutorials

* Interpretation of 200+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai

Command Palette

Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and answers...

Summary of Chinese Medical Datasets

About HyperAI

Command Palette

Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and answers...

Summary of Chinese Medical Datasets

About HyperAI

Related News

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Command Palette

Summary of 10 Major Chinese Medical Datasets: Covering Shennong Chinese Medicine, Ancient Chinese Medicine Books, Medical Reasoning, Medical Questions and answers...

Summary of Chinese Medical Datasets

About HyperAI

Related News

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.