Command Palette
Search for a command to run...
Breast Cancer: Multi-Modal Fusion Dataset
Date
Publish URL
License
CC BY 4.0
Breast Cancer: Multi-Modal Fusion is a preprocessed multimodal dataset built for patients with invasive breast cancer (BRCA). It aims to provide a plug-and-play foundation for building multimodal fusion networks and is widely used in research scenarios such as multimodal fusion modeling, radiomics, survival prediction, and personalized treatment analysis. This dataset rigorously aligns multi-source data from 122 BRCA patients. All samples were mapped across modalities using TCGA Case IDs, achieving a one-to-one correspondence between macroscopic medical imaging (MRI), microscopic digital pathology (Histopathology), multi-omics, and clinical treatment information. The data is organized in the form of CSV, pathological patch images, and mapping files.
Data composition
Vision Modality
- MRI scan (mri_processed): Preprocessed breast MRI images used to study tumor structure and imaging features.
- Histopathological slides (SVS_patches): High-resolution pathological slide patches extracted from Whole Slide Images (WSIs), which can be directly used for training visual models such as CNN and Vit.
- The tissue mapping file (MRI_and_SVS_Patches_index.json) is used to establish the mapping relationship between pathological patches and patients, facilitating the construction of PyTorch or TensorFlow data loaders. Multi-Omics
- Transcriptomics (RNA_RAW.csv): Standardized RNA-Seq gene expression data
- Copy number variants (CNV_RAW.csv): Amplification and deletion characteristics of copy number variants (CNVs)
- Fusion omics features (RNA_CNV_ModelReady.csv): A standardized feature file containing RNA and CNV data, which can be directly used as input to a neural network.
- Somatic mutation data (Mutations_Dataset.csv): A list of somatic mutated genes aggregated by patient. Clinical & Treatment Data
- Clinical treatment data (Clinical_Treatment_Data.csv): Cleaned clinical and treatment data file
- Clinical fields include demographic information, survival status (vital_status), and TNM pathological stage.
- Drug coding matrix: Provides one-hot coding features for drugs such as Drug_Tamoxifen and Drug_Paclitaxel, used for correlation analysis between treatment regimens and patient prognosis.
Citation
The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) Data Collection. Genomic and clinical data retrieved from the GDC Data Portal belonging to the TCGA-BRCA project
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.