HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 3 ans

Prédiction de la maladie coronarienne à l'aide d'analyses sanguines de routine

Ning Meng Peng Zhang Junfeng Li Jun He Jin Zhu

Prédiction de la maladie coronarienne

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

L’objectif de cette étude était d’examiner l’association entre les résultats des analyses sanguines de routine et le risque de maladie coronarienne (MC), de les intégrer dans des modèles de prédiction coronarienne et de comparer les propriétés de discrimination de cette approche avec d’autres fonctions de prédiction. Cette recherche a été conçue comme une étude rétrospective, monocentrique, portant sur une cohorte hospitalière. Les 5 060 patients atteints de MC (2 365 hommes et 2 695 femmes) avaient entre 1 et 97 ans au moment de l’inclusion, avec 8 ans (2009–2017) de dossiers médicaux, 5 051 bilans de santé et 5 075 cas d’autres maladies. Nous avons développé un modèle d’Arbres de Décision par Gradient Boosting (GBDT) à deux couches, basé sur les données sanguines de routine, afin de prédire le risque de maladie coronarienne, permettant d’identifier 86 % des personnes atteintes de cette pathologie. Nous avons constitué un jeu de données composé de 15 000 résultats d’analyses sanguines de routine. À l’aide de ce jeu de données, nous avons entraîné le modèle GBDT à deux couches pour classifier l’état de santé, la maladie coronarienne et d’autres maladies. Suite à la classification obtenue par apprentissage automatique, nous avons constaté que la sensibilité de détection des données de santé était d’environ 93 % pour l’ensemble des données, et que la sensibilité de détection de la MC était de 93 % pour les données pathologiques incluant la maladie coronarienne. Sur cette base, nous avons ensuite visualisé la corrélation entre les résultats des analyses sanguines de routine et les éléments de données associés, révélant un motif évident concernant l’état de santé et la maladie coronarienne dans toutes les présentations de données, ce qui peut servir de référence clinique. Enfin, nous avons brièvement analysé les résultats ci-dessus du point de vue de la physiopathologie. Les données sanguines de routine fournissent plus d’informations sur la MC que ce qui était déjà connu grâce à la corrélation entre les résultats des tests et les éléments de données associés. Un modèle simple de prédiction de la maladie coronarienne a été développé à l’aide d’un algorithme GBDT, permettant aux médecins de prédire le risque de MC chez les patients ne présentant pas de MC manifeste.

One-sentence Summary

Drawing on 15,000 routine blood test results, this study develops a two-layer Gradient Boosting Decision Tree (GBDT) model that classifies healthy status, coronary heart disease, and other conditions with approximately 93% sensitivity, demonstrating the clinical utility of routine blood markers for CHD risk prediction.

Key Contributions

  • A two-layer Gradient Boosting Decision Tree (GBDT) model is developed using a dataset of 15,000 routine blood test records to classify healthy status, coronary heart disease, and other diseases. This framework enables early risk stratification by processing standard hematological markers.
  • The algorithm achieves approximately 93% sensitivity for general health classification and identifies 86% of coronary heart disease cases within the evaluated cohorts. These performance metrics demonstrate the predictive utility of routinely collected laboratory data.
  • Correlations between specific hematological indices, including platelet distribution width and red cell distribution width, and coronary pathology are visualized and analyzed. This investigation reveals distinct physiological patterns that differentiate healthy cohorts from diseased patients and supports pathophysiological interpretation.

Introduction

Coronary heart disease imposes a significant global health burden, yet early detection remains difficult because conventional diagnostics like angiography only reveal advanced pathology. Existing clinical risk scores frequently fail to identify high-risk individuals, particularly younger patients, due to their dependence on specialized lipid panels, clinical assessments, or genomic testing that are often costly or inaccessible. To bridge this gap, the authors leverage widely available routine blood test data to train a two-layer Gradient Boosting Decision Tree model that classifies health status and predicts coronary heart disease risk with approximately 93 percent sensitivity. By correlating specific hematological markers with underlying pathophysiological mechanisms such as chronic hypoxia, systemic inflammation, and coagulation dysregulation, the team delivers a low-cost, automated screening framework that empowers clinicians to initiate preventative interventions earlier.

Dataset

  • Dataset Composition and Sources: The authors compiled clinical records from 16,860 patients enrolled across eastern China, extracting information from outpatient systems, inpatient examination logs, and routine health check databases. The cohort was initially divided into three diagnostic categories: coronary heart disease, other diseases, and a healthy population.
  • Subset Details and Labeling: The initial breakdown included 5,060 CHD patients, 5,075 patients with other diseases, and the remaining healthy individuals. Original labels assigned a value of 1 to CHD, -1 to other diseases, and 0 to healthy subjects. After quality control steps reduced the pool to 15,033 records, the authors randomly sampled 5,000 cases from each group to establish a strictly balanced 1:1:1 mixture.
  • Data Processing and Feature Engineering: Raw inputs were standardized, including binary conversion of gender categories. High sparsity was addressed by first dropping rows with missing values, then imputing remaining gaps using group-specific averages. Clinical outliers were filtered using domain-specific rules. The final feature matrix combined demographic basics with 22 standardized blood routine indices, such as white blood cell count, hemoglobin levels, platelet parameters, and differential cell percentages.
  • Training Strategy and Model Usage: The authors structured the analysis into a two-layer classification pipeline. The first layer separated healthy subjects from all diseased patients using a 10,000 record subset split into 70 percent training and 30 percent validation. The second layer isolated CHD from other diseases by removing healthy records and applying the same 70/30 partition to the remaining 10,000 cases. The balanced data was evaluated across logistic regression, support vector machines, and gradient boosting decision trees, with GBDT selected as the optimal model following grid search tuning (learning rate of 0.23 and 70 estimators).

Method

The authors leverage a two-layer Gradient Boosting Decision Tree (GBDT) classification model to perform low-cost risk assessment for coronary heart disease (CHD) using blood routine data. The model is designed to classify patients into high-risk and low-risk categories based on a large dataset of clinical cases. The overall framework consists of two sequential stages, where the first layer performs an initial classification, and the second layer refines the prediction using a subset of features, thereby enhancing the model's discriminative capability. Each layer is composed of multiple decision trees, with the construction of each tree involving feature selection based on a specified metric. Features that are selected closer to the root node of the tree and split more frequently are considered more important, reflecting their higher contribution to classification accuracy.


As shown in the figure below, the authors conduct a correlation analysis using the selected features with the highest contribution in both layers—LY%, HCT, RBC, age, RDW, BASO%, and LY—based on feature importance rankings. The scatter plot illustrates the aggregation effect of healthy individuals and CHD patients, with blue dots representing the healthy population and red dots representing CHD patients. The separation between the two groups is evident, indicating that the selected features effectively capture distinguishing patterns between the populations. The visualization also reveals that the healthy population exhibits a tighter aggregation, which aligns with the model's higher recall rate for healthy individuals (91%) compared to CHD patients (86.5%).


Refer to the framework diagram, which presents a visual representation of the relationships among all features in the blood routine data. The authors connect the features with dashed lines to illustrate the complex interdependencies and clustering patterns. The plot distinguishes between healthy individuals (green), CHD patients (red), and other diseases (blue), with clear separation between green and red clusters. This confirms the model's classification performance and provides insight into the underlying data structure, demonstrating that the selected features form distinct groupings that support the model’s predictive decisions.

Experiment

The evaluation employs a two-layer gradient boosting decision tree model trained on routine blood test data to classify individuals into healthy, coronary heart disease, and other disease categories. This experimental design validates both the predictive equivalence of hierarchical versus direct classification and the interpretability of clinical biomarkers by systematically mapping data associations to underlying physiological mechanisms. Qualitative analysis confirms that the framework effectively distinguishes disease states while revealing clear, clinically relevant patterns that align with known pathophysiological processes. Ultimately, the study concludes that routine blood tests provide sufficient structural information to support accurate risk stratification and meaningful clinical interpretation.

The authors compare the performance of different machine learning algorithms, including logistic regression, SVM, and GBDT, using a dataset of routine blood test data. Results show that the GBDT algorithm achieves higher accuracy, sensitivity, and specificity compared to the other two models. GBDT outperforms LR and SVM in terms of accuracy, sensitivity, and specificity. The GBDT algorithm demonstrates the highest sensitivity among the compared models. SVM shows better performance than LR across all evaluation metrics.

{"summary": "The authors developed a two-layer GBDT model to classify health, coronary heart disease, and other diseases using routine blood test data. The model achieved high precision and recall in both layers, with the first layer distinguishing healthy from diseased individuals and the second layer further classifying diseased cases into coronary heart disease and other conditions. Results show that the two-layer approach performs comparably to a direct three-classification model while providing interpretability for clinical use.", "highlights": ["The two-layer GBDT model achieves high precision and recall in distinguishing healthy individuals from diseased people in the first layer.", "In the second layer, the model maintains strong performance in identifying coronary heart disease and other diseases.", "The model's classification results are consistent with the overall performance reported in the study, supporting its use for clinical risk prediction."]

The authors compare multiple machine learning algorithms for classifying health, coronary heart disease, and other diseases using routine blood test data. The GBDT model achieves the highest accuracy and sensitivity among the compared models, indicating strong performance in identifying both healthy individuals and those with coronary heart disease. The GBDT model outperforms LR and SVM in both accuracy and sensitivity. The GBDT model achieves high sensitivity for identifying both healthy individuals and those with coronary heart disease. The two-layer GBDT model is effective in classifying health, coronary heart disease, and other diseases using routine blood test data.

The authors developed a two-layer GBDT model using routine blood test data to predict coronary heart disease, achieving comparable performance to a three-classification model. Results show that the model identifies individuals with coronary heart disease with high precision and recall, and the approach reveals consistent patterns in blood test data that may have clinical relevance. The model also demonstrates strong sensitivity for both health and disease categories. The two-layer GBDT model achieves high precision and recall for coronary heart disease detection. The model demonstrates strong sensitivity in identifying both healthy individuals and those with coronary heart disease. The approach reveals consistent patterns in routine blood test data that could be useful for clinical reference.

The experiments compare multiple machine learning algorithms using routine blood test data to classify health status and specific cardiovascular conditions. These evaluations validate the effectiveness of a hierarchical two-layer GBDT architecture against a direct multi-class approach, emphasizing classification reliability and clinical interpretability. The results consistently indicate that GBDT substantially outperforms logistic regression and SVM across all performance dimensions. Ultimately, the two-layer framework proves highly effective for disease stratification, matching the accuracy of direct classification while delivering transparent, clinically actionable insights for routine health screening.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp