HyperAIHyperAI

Summary of 10 Major Medical Datasets: Covering Question answering/reasoning/real Clinical records/ultrasound images/CT images...

特色图像

With the deep integration of artificial intelligence technology in the medical field and the continuous innovation of medical imaging technology, medical data, as the key to unlocking the mysteries of life, is accumulating and growing at an explosive rate. It has broken through the boundaries of traditional medical research and brought revolutionary changes to disease diagnosis and treatment and health management.

As medical research moves from experience-driven to data-driven, the iteration speed of basic research tools has gradually slowed down.The quality of medical data sets has become a core factor in determining whether a model can move from theoretical conception to clinical practical application.High-quality medical data can not only accurately capture disease characteristics, but also provide reliable support for the formulation of personalized medical plans.

The construction of a medical data set is by no means a simple listing of cases.Compared with general data collection, the acquisition of medical data needs to strictly follow ethical standards to ensure patient privacy and data usage compliance.In order to ensure the scientificity and effectiveness of the data, it is necessary to standardize the data collection process, reasonably allocate training sets, validation sets and test sets, and establish a dynamic update mechanism to regularly add new data to adapt to changes in the disease spectrum and the development of diagnosis and treatment technologies. In the face of complex medical tasks such as disease diagnosis, drug development, and health prediction, when constructing a data set, it is necessary to deeply analyze the needs of various fields, integrate multimodal information, simulate real clinical scenarios, and provide practical learning samples for model training.

In short, in the era of precision medicine, the demand for high-quality medical data sets in the entire medical community has grown explosively.HyperAI has compiled a series of extremely valuable and widely used medical data sets for everyone, covering multiple medical professional fields such as cancer, heart, bone X-ray, etc.Some of them come from top medical schools and authoritative medical institutions.

Click to view more open source datasets:

https://go.hyper.ai/g9PvL

Medical Dataset Summary

1 JMED Chinese real medical data dataset

Download address:https://go.hyper.ai/4jJTa

The JMED dataset is a new dataset based on the distribution of real-world medical data. It was built by the Citrus Team in 2025. The dataset is derived from anonymous doctor-patient conversations in JD Health Internet Hospital and is filtered to retain consultations that follow a standardized diagnostic workflow. The initial version contains 1k high-quality clinical records covering all age groups (0-90 years old) and multiple specialties. Each question includes 21 answer options.

Unlike existing datasets, JMED closely simulates real clinical data while facilitating effective model training. Although based on real consultation data, it is not directly derived from actual medical data, so the research team can integrate the key elements required for model training.

2 MedQA Medical Text Question Answering Dataset

Estimated size:125.64 MB

Download address:https://go.hyper.ai/VfIWx

The MedQA dataset is a question-answering dataset for the medical field that simulates the style of the United States Medical Licensing Examination (USMLE). It was released in 2020 by a research team from MIT and Huazhong University of Science and Technology. The related paper result is "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams".

The dataset contains 12,723, 34,251 and 14,123 questions respectively, and is designed to evaluate the model's ability to understand and apply medical knowledge. It is divided into training set, development set and test set, which are used for model training, verification and testing respectively.

3 Medical O1 Reasoning SFT 

Medical Reasoning Datasets

Estimated size:21.71 MB

Download address:https://go.hyper.ai/iVUWA

The Medical o1 Reasoning SFT dataset was released by the Chinese University of Hong Kong and Shenzhen Institute of Big Data in 2024. The related paper result is "HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs".

This dataset is designed for fine-tuning the HuatuoGPT-o1 medical language model to improve its performance in complex medical reasoning tasks. The construction of the dataset relies on GPT-4o, which ensures the accuracy and reliability of the data by searching for verifiable medical questions and verifying the answers using a medical verifier.

4 ROCOv2 Radiology 

Multimodal medical image dataset

Estimated size:17.29 GB

Download address:https://go.hyper.ai/xs4zS

ROCOv2 (Radiology Object in COntext Version 2) is an innovative multimodal medical image dataset that combines radiology images with related medical concepts and descriptions. This dataset extracts radiology images and related medical concepts and descriptions from the PMC Open Access subset, and improves concept extraction and filtering based on the ROCO dataset.

The dataset contains 79,789 radiology images, covering a variety of clinical modalities, anatomical regions, and orientations (for X-rays), each with a corresponding medical concept description. It can be used to train image annotation models, multi-label image classification, pre-training of medical domain models, deep learning model evaluation, image retrieval, and caption generation, etc.

5 MedCalc-Bench medical computing dataset

Estimated size:16.04 MB

Download address:https://go.hyper.ai/pDbcu

MedCalc-Bench is a dataset specifically designed to evaluate the medical computing capabilities of large language models (LLMs). It was jointly released in 2024 by nine institutions including the National Library of Medicine, National Institutes of Health and the University of Virginia. The related paper result is "MEDCALC-BENCH: Evaluating Large Language Models for Medical Calculations", which has been accepted by NeurIPS 2024.

The dataset contains 10,055 training instances and 1,047 test instances, covering 55 different computational tasks. Each instance includes a patient's notes, a question to calculate a specific clinical value, the final answer value, and a step-by-step solution. It is divided into training and test sets and can be used to fine-tune LLMs to improve their performance in medical computational tasks.

6 AI Medical Chatbot Medical Conversation Dataset

Estimated size:118.35 MB

Download address:https://go.hyper.ai/W5OnS

This is an experimental dataset designed for running medical chatbots, which contains 256,916 conversations between patients and doctors.

7 TCGA-ESCA Cancer CT Imaging

Estimated size:3.79 GB

Download address:https://go.hyper.ai/eJWQt

TCGA – ESCA Cancer CT Images is a dataset related to esophageal cancer, published by the GDC Data Portal. It contains 5,271 data files from 185 people. The dataset aims to digitally track the entire cancer diagnosis and treatment process and record the examination results, prescriptions, and efficacy in the form of digital archives.

8 TCGA-KICH Cancer CT Imaging 

Estimated size:1.62 GB

Download address:https://go.hyper.ai/iVUWA

TCGA – KICH Cancer CT Images is a dataset related to adenoma and adenocarcinoma, published by GDC Data Portal. It contains 2,325 data files from 113 people. The dataset aims to digitally track the entire cancer diagnosis and treatment process and record the examination results, prescriptions, and efficacy in the form of digital archives.

9 Cancer CT image data 

Estimated size:367.88 MB

Download address:https://go.hyper.ai/tsMh5

CT Medical Image Analysis Tutorial: CT images from cancer imaging archive with contrast and patient age Dataset is a cancer CT image dataset released by Kaggle in 2016. The related paper is "Radiology Data from The Cancer Genome Atlas Lung Adenocarcinoma [TCGA-LUAD] collection".

It contains 475 case CT images of 69 patients to examine and compare the association between patient age and CT image data, and it is a part of the TCGA-LUAD lung cancer CT image database.

10 MURA bone X-ray dataset 

Estimated size:6.74 GB

Download address:https://go.hyper.ai/DlGYH

MURA Dataset is a large bone X-ray dataset that aims to determine whether bones are normal through X-rays. The dataset was released by Stanford University in 2017. The related paper is "MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs".

The publisher hopes that the dataset will lead to significant advances in medical imaging techniques that can make diagnoses at an expert level to improve health care in areas with a limited number of radiologists.