HyperAIHyperAI

Command Palette

Search for a command to run...

TRIAGE : Raisonnement dialectique pour la prédiction de risque explicable sur des séries temporelles médicales à échantillonnage irrégulier avec des LLMs

Hyeongwon Jang Gyouk Chu Changhun Kim Joonhyung Park Hangyul Yoon Eunho Yang

Résumé

Les systèmes d'alerte précoce clinique, fondés sur les dossiers de santé électroniques où les observations cliniques sont enregistrées sous forme de séries temporelles médicales échantillonnées de manière irrégulière (ISMTS), doivent fournir à la fois des scores de risque calibrés pour le triage des patients et des justifications interprétables que les cliniciens peuvent vérifier. Les Modèles de Langage de Grande Taille (LLM) ont été explorés pour cette tâche, mais ils réduisent le risque clinique gradué à des prédictions binaires surconfiantes. Cette polarisation des risques compromet à la fois la calibration et la comparabilité inter-patients. Pour remédier à ce problème, nous proposons TRIAGE, un cadre qui entraîne un LLM à générer un raisonnement dialectique sur des issues cliniques concurrentes en sollicitant des justifications spécifiques à chaque issue. Cette formulation dialectique atténue la polarisation des risques, permettant à un unique LLM de produire des scores de risque continus fondés sur un raisonnement clinique explicite. Évalué sur trois jeux de référence ISMTS, TRIAGE obtient une amélioration moyenne de l'AUPRC de 3,3 % et réduit l'erreur de calibration de 81 % par rapport aux méthodes de référence compétitives. Une évaluation de type LLM-as-a-judge montre en outre que nos justifications surpassent les explications post-hoc de la méthode de référence de 20 % en termes de qualité du raisonnement clinique. Le code source est disponible à l'adresse https://github.com/HyeongWon-Jang/TRIAGE .

One-sentence Summary

The authors propose TRIAGE, a framework that trains large language models to generate dialectical reasoning over competing clinical outcomes, yielding continuous, calibrated risk scores and interpretable rationales for irregularly sampled medical time series that achieve an average AUPRC improvement of 3.3% and reduce calibration error by 81% across three benchmarks while surpassing baseline post-hoc explanations by 20% in clinical reasoning quality.

Key Contributions

  • TRIAGE is a framework that trains large language models to generate dialectical reasoning over competing clinical outcomes for irregularly sampled medical time series. By eliciting outcome-specific rationales instead of forcing a single prediction, the method prevents probability collapse and yields continuous, cross-patient comparable risk scores.
  • The approach implements a two-stage training pipeline comprising dialectical reasoning supervision and self-refinement to align implicit probability distributions with patient evidence. This formulation enables the model to weigh coexisting clinical signals of deterioration and stabilization during a single forward pass.
  • Evaluated on three irregularly sampled time series benchmarks, the framework achieves an average 3.3% improvement in AUPRC and reduces calibration error by 81% relative to competitive baselines. LLM-as-a-judge evaluations further demonstrate that the generated rationales surpass post-hoc explanations by 20% in clinical reasoning quality.

Introduction

Clinical early warning systems leverage irregularly sampled medical time series from electronic health records to predict adverse events, necessitating both well-calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Existing solutions fail to deliver these capabilities jointly, as specialized deep learning models remain opaque and post-hoc explainability methods provide only non-linguistic attribution scores rather than high-level clinical reasoning. LLM-based approaches face similar trade-offs, often sacrificing natural language explanations for continuous risk scoring or suffering from risk polarization where elicited reasoning pre-commits to a single outcome, collapsing probability distributions into overconfident binary predictions. The authors propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by producing outcome-specific rationales. This method mitigates risk polarization and allows a single model to yield continuous, cross-patient comparable risk scores grounded in explicit clinical reasoning via a two-stage pipeline of dialectical reasoning supervision and self-refinement.

Dataset

Experiment

Evaluated across three irregular medical time series benchmarks for sepsis and mortality prediction, the experiments validate TRIAGE’s predictive accuracy, calibration, and robustness against missing clinical variables. Ablation studies demonstrate that dialectical reasoning and batch-level reinforcement learning significantly outperform traditional single-outcome supervision and sample-level rewards, effectively mitigating risk saturation and improving score reliability. Qualitative assessments further confirm that the model generates clinically grounded, temporally aware rationales that avoid the one-sided biases and hallucinations prevalent in standard LLMs and post-hoc interpretability methods. Overall, the findings establish TRIAGE as a robust, data-efficient framework that successfully bridges high-fidelity clinical reasoning with reliable risk prediction.

The authors evaluate TRIAGE against established ISMTS baselines on the P12 dataset under varying training data fractions. Results indicate that TRIAGE significantly outperforms existing methods when labeled data is scarce, demonstrating high data efficiency. As data availability increases, TRIAGE remains competitive, achieving top-tier performance across most evaluation criteria. TRIAGE significantly outperforms baselines in low-data regimes, demonstrating high data efficiency. The method consistently achieves the highest AUROC scores across all evaluated data fractions. TRIAGE maintains competitive performance relative to top baselines as training data availability increases.

The authors assess the robustness of their model, TRIAGE, against established baselines by randomly masking input variables during inference. The experiment reveals that TRIAGE preserves its predictive capability more effectively than competing methods as the proportion of missing data increases. The reinforcement learning-tuned variant demonstrates particular strength in maintaining high discrimination metrics under severe data scarcity. TRIAGE demonstrates greater stability than baselines as the percentage of missing variables rises from 10% to 50%. The RL-tuned TRIAGE variant secures the top performance rank for AUPRC at higher missingness levels. While baselines lead at lower removal rates, TRIAGE consistently remains in the top tier of performers across all tested scenarios.

The authors evaluate their proposed TRIAGE model against several ISMTS baselines under varying variable masking ratios on the MIMIC-III dataset. Results demonstrate that both the supervised fine-tuned and reinforcement learning-enhanced versions of TRIAGE consistently achieve superior discrimination performance compared to existing methods across all tested missing data scenarios. The model maintains robust predictive accuracy even as the proportion of masked variables increases, highlighting its resilience to incomplete clinical information. The proposed TRIAGE model, particularly after reinforcement learning, consistently outperforms all ISMTS baselines in both AUROC and AUPRC metrics across all variable masking ratios. TRIAGE demonstrates superior robustness to missing data, maintaining high predictive performance even when up to half of the input variables are removed. The reinforcement learning stage provides a notable performance boost over supervised fine-tuning alone, achieving the best results on nearly all evaluation points.

The authors evaluate model performance across varying training data sizes on the P12 dataset. Results demonstrate that all methods benefit from increased data availability, with performance gains evident as the training set expands. The proposed method consistently achieves the highest discrimination scores across all data fractions, showing the most significant advantage in low-resource settings. The proposed method achieves the highest AUROC and AUPRC scores across all data sizes, with the most significant advantage at the lowest training fraction. Performance for all models improves as the training data size increases. The method maintains a clear lead over baselines even as data availability grows, though the margin narrows slightly at higher data fractions.

The authors analyze the effectiveness of various reasoning structures by comparing zero-shot inference with supervised fine-tuning baselines. The results demonstrate that the proposed TRIAGE_SFT approach delivers the highest performance in discrimination metrics compared to all other tested methods. TRIAGE_SFT achieves superior performance in discrimination metrics relative to zero-shot inference and other supervised baselines. One-sided rationale supervision results in lower predictive accuracy compared to answer-only supervised fine-tuning. Zero-shot inference exhibits the weakest performance among the evaluated reasoning strategies.

The authors evaluate TRIAGE against established baselines on the P12 and MIMIC-III datasets by testing performance across varying training data fractions, increasing input variable masking ratios, and distinct reasoning strategies. These experiments validate the model's high data efficiency in low-resource settings, its predictive stability under severe missing data conditions, and the effectiveness of reinforcement learning over zero-shot or standard supervised approaches. Overall, TRIAGE demonstrates superior resilience to training scarcity and incomplete clinical information while maintaining consistent top-tier performance across all tested scenarios.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp