Command Palette
Search for a command to run...
영국 국민보건서비스(NHS) 원격의료에서의 대규모언어모델(LLM) 약물안전성 검토의 실제 적용 평가
영국 국민보건서비스(NHS) 원격의료에서의 대규모언어모델(LLM) 약물안전성 검토의 실제 적용 평가
Oliver Normand Esther Borsi Mitch Fruin Lauren E Walker Jamie Heagerty Chris C. Holmes Anthony J Avery Iain E Buchan Harry Coppock
초록
대규모 언어 모델(Large Language Models, LLM)은 의료 벤치마크에서 임상의 수준의 성능을 달성하거나 이를 초과하는 경우가 많지만, 실제 임상 데이터에서 평가받거나 헤드라인 지표를 넘어서는 심층적 분석을 받는 경우는 매우 드물다. 본 연구에서는, 우리 지식에 따르면 최초로 영국 국영보건서비스(NHS)의 주요 의료 데이터를 기반으로 한 LLM 기반 약물 안전성 검토 시스템의 실증 평가를 수행하며, 다양한 임상 복잡도 수준에서 나타나는 주요 오류 행동들을 상세히 분석하였다. 영국 NHS 체셔 및 머세이사이드 지역의 212만 5,549명 성인을 대상으로 한 대규모 전자건강기록(EHR) 데이터베이스를 활용한 회귀적 연구에서, 다양한 임상 복잡성과 약물 안전성 위험을 포괄할 수 있도록 전략적으로 환자를 샘플링하였으며, 데이터 품질 제외 기준을 적용한 결과 총 277명의 환자가 분석 대상으로 선정되었다. 각 환자의 경우에 대해 전문 임상의가 시스템이 탐지한 문제 및 제안된 개입 조치를 평가하고 등급화하였다. 본 연구의 주요 LLM 시스템은 임상 문제 존재 여부를 탐지하는 데 매우 뛰어난 성능을 보였다(민감도 100% [95% 신뢰구간 98.2–100], 특이도 83.1% [95% 신뢰구간 72.7–90.1]), 그러나 문제와 개입 조치를 모두 정확히 식별한 경우는 전체 환자 중 단 46.9% [95% 신뢰구간 41.1–52.8]에 불과했다. 오류 분석 결과, 이 환경에서 주요한 오류 메커니즘은 약물 지식 부족이 아니라 맥락 인식 능력의 결여였으며, 다섯 가지 주요 패턴이 확인되었다: 불확실성에 대한 과도한 자신감, 환자 맥락을 반영하지 않고 일반 가이드라인을 그대로 적용, 실질적인 의료 제공 방식을 오해, 사실 오류, 그리고 절차에 대한 무관심(프로세스 블라인드). 이러한 오류 패턴은 환자의 임상 복잡도와 인구통계학적 특성에 관계없이, 다양한 최신 기술 수준의 모델과 설정에서도 지속적으로 나타났다. 본 연구에서는 모든 식별된 오류 사례를 포괄적으로 다룬 45개의 상세한 사례 연구(vignettes)를 제시한다. 본 연구는 LLM 기반 임상 인공지능이 안전하게 도입되기 전에 해결해야 할 핵심적 한계를 드러내며, 대규모의 전향적 평가와 임상 환경에서 LLM의 행동 특성에 대한 심층적 연구가 절실히 필요함을 시사한다.
One-sentence Summary
Researchers from i.AI/DSIT, University of Liverpool, and et al. present the first real-world NHS evaluation of an LLM medication safety system, identifying contextual reasoning failures—particularly overconfidence and guideline rigidity—as the dominant limitation across 277 complex patient cases despite 100% sensitivity, revealing critical safety gaps requiring resolution before clinical deployment.
Key Contributions
- This study addresses the critical gap in evaluating large language models (LLMs) on real-world clinical data, as most prior research relies on synthetic benchmarks despite LLMs achieving clinician-level scores in controlled settings. It presents the first evaluation of an LLM-based medication safety review system using actual NHS primary care records, focusing on detailed failure analysis beyond headline metrics.
- The authors introduced a rigorous hierarchical evaluation framework applied to a population-scale EHR dataset of 2,125,549 adults, strategically sampling 277 high-complexity patients for expert clinician review to characterize failure modes across real-world clinical scenarios. Their analysis revealed that contextual reasoning failures—not missing medical knowledge—dominate errors, identifying five persistent patterns: overconfidence in uncertainty, rigid guideline application, misunderstanding healthcare delivery, factual errors, and process blindness.
- Evidence from the retrospective study shows the primary LLM system achieved high sensitivity (100%) but correctly identified all issues and interventions in only 46.9% of patients, with failure patterns consistent across patient complexity, demographics, and multiple state-of-the-art models. The work provides 45 detailed clinical vignettes documenting these failures, demonstrating that current LLMs cannot reliably handle nuanced clinical reasoning required for safe deployment.
Introduction
Medication safety reviews are critical in primary care due to the high global burden of preventable harm from prescribing errors, which cost the NHS up to £1.6 billion annually and contribute to 8% of hospital admissions. While large language models (LLMs) show promise in matching clinician-level performance on medical benchmarks, prior evaluations suffer from significant gaps: most studies rely on synthetic data or exam-style questions rather than real patient records, automated scoring often misses clinically significant errors like hallucinations, and failure modes remain poorly characterized in complex clinical workflows. The authors address this by conducting a real-world evaluation of LLMs on actual NHS primary care records, introducing a three-level hierarchical framework to dissect failures. They demonstrate that contextual reasoning flaws—such as misjudging temporal relevance or patient-specific factors—outnumber factual inaccuracies by a 6:1 ratio across 148 patients, revealing critical gaps between theoretical model competence and safe clinical deployment.
Dataset
The authors use electronic health records from NHS Cheshire and Merseyside's Trusted Research Environment (GraphNet Ltd.), covering 2,125,549 unique adults aged 18+ with structured SNOMED CT and dm+d coded data. Key details include:
- Primary dataset: Longitudinal patient profiles with GP events, medications, hospital episodes, and clinical observations (no free-text notes). Mean 1,010 GP events per patient; dates span 1976–2025. Population shows higher deprivation (26.6% in England’s most deprived decile vs. 7.7% least deprived).
- Evaluation subset: 300 patients sampled from a 200,000-patient test set, reduced to 277 after excluding 23 cases with data issues. Sampling included:
- 100 patients with prescribing safety indicators (stratified across 10 indicators)
- 100 indicator-negative patients matched on age, sex, prescriptions, and recent GP events
- 50 System-positive and 50 System-negative cases from indicator-negative populations
The authors processed raw parquet files into structured Pydantic patient profiles, converting them to chronologically ordered markdown for analysis. This included demographics, diagnoses, active/past medications, lab results, and clinical observations. For evaluation, clinician feedback on the 277 cases was synthesized into ground truth (validated clinical issues and interventions) using gpt-oss-120b. Data governance followed NHS Cheshire & Merseyside approvals, with pseudonymized profiles analyzed under strict TRE protocols (June–November 2025).
Method
The authors leverage a structured, three-stage pipeline to evaluate medication safety systems, integrating automated analysis with clinician validation. The process begins with longitudinal patient profiles extracted from electronic health records (EHRs), which include demographics, QOF registers, clinical events, and active medications. These profiles are formatted into a standardized patient prompt and fed into the system, which is instructed to flag safety issues and propose interventions. The system’s output comprises a summary, a list of clinical issues with supporting evidence, and a specific intervention plan if warranted.
Refer to the framework diagram, which illustrates the end-to-end workflow: from raw EHR-derived patient data through system analysis to clinician review. The system’s output is then manually evaluated by clinicians who assess both the presence and correctness of flagged issues and the appropriateness of proposed interventions. This manual review serves as the ground truth for subsequent automated scoring.
The evaluation employs a three-level hierarchical framework. Level 1 assesses whether the system correctly identifies any issue when one exists. Level 2, conditional on Level 1 success, evaluates whether the system correctly identifies all relevant issues. Level 3, conditional on Level 2, determines whether the proposed intervention directly resolves the identified safety concern. This tiered approach enables granular diagnosis of failure points, revealing where and why the system breaks down.
To scale evaluation across multiple models and configurations, the authors developed an automated scorer, Sautomated, which uses clinician-reviewed cases as ground truth. The scorer synthesizes clinician feedback into structured issue and intervention lists using an LLM, then employs a separate LLM judge to compute alignment scores based on F1 metrics for issue identification and intervention appropriateness. True negatives (agreement on absence of issues) receive a score of 1.0; disagreements on issue presence receive 0.0. This scorer enables quantitative comparison across models and analysis of performance variation, such as by patient ethnicity.
The system’s output must conform to a strict JSON schema, including fields for patient review, clinical issues (each with issue, evidence, and intervention_required), intervention plan, and intervention probability. Interventions must be specific, actionable, and directly resolve the safety concern—such as “Stop diltiazem”—rather than vague recommendations. The intervention_required flag is set to true only if the issue poses substantial, current, evidence-based risk that can be resolved with a concrete action.
Failure modes are systematically categorized using a five-category taxonomy derived from clinician review of 178 failure instances across 148 patients. Each failure is also classified by potential clinical harm using the WHO patient safety harm categories, ranging from none to death, to assess real-world impact if the system’s recommendation were implemented without review.
Experiment
- Evaluated gpt-oss-120b LLM system on real NHS primary care data with 277 patients, achieving 100% sensitivity [95% CI 98.2–100] and 83.1% specificity [95% CI 72.7–90.1] for binary intervention detection, but only 46.9% [95% CI 41.1–52.8] fully correct outputs identifying issues and interventions.
- Failure analysis revealed 86% of errors stemmed from contextual reasoning (e.g., overconfidence, protocol misapplication) versus 14% factual errors, with consistent patterns across patient complexity and demographic groups.
- Performance declined with clinical complexity (medication count r = -0.28, p < 0.001), and multi-model comparison showed gpt-oss-120b-medium outperformed smaller/fine-tuned models by 5.6–70.3% in clinician scoring.
- Anchoring bias analysis indicated a 7.9% gap between clinician agreement accuracy (95.7%) and model self-consistency ceiling (87.8%), while ethnicity counterfactual testing showed no significant performance differences across White, Asian, or Black patient profiles (p=0.976).
The authors use a clinician-reviewed evaluation to assess an LLM-based medication safety system on real NHS data, revealing that while the system achieves 100% sensitivity in flagging cases with issues, it correctly identifies all relevant issues in only 58.7% of those cases and proposes fully appropriate interventions in 58.7% of flagged cases. Results show that the system’s performance degrades across hierarchical evaluation levels, with only 46.9% of all patients receiving fully correct outputs, highlighting a gap between detecting problems and delivering contextually appropriate solutions.

The authors use a stratified sampling approach to evaluate an LLM-based medication safety system on real NHS data, reporting population-level performance metrics derived from a subset of 95 unselected patients. Results show the system achieves 100% sensitivity and 92.3% specificity, with 95.5% overall accuracy and near-perfect negative predictive value, indicating strong detection of true negatives but potential for false positives in real-world deployment.

The authors evaluated clinician agreement with automated prescribing safety indicators across 73 cases, finding 69.9% overall agreement that flagged issues warranted intervention. Agreement varied significantly by indicator type, ranging from 100% for absolute contraindications like diltiazem/verapamil with heart failure to 50% for methotrexate without liver function test monitoring, highlighting that deterministic rules often require contextual clinical judgment.

The authors evaluated an LLM system on 52 clinician-validated prescribing safety indicator cases and found it achieved 100% sensitivity at Level 1, correctly identifying all issues requiring intervention. At Level 2, the system accurately identified the specific indicator in 82.7% of cases, and at Level 3, it proposed appropriate interventions in 76.9% of cases where the issue was correctly identified. These results show strong detection capability but reveal a decline in precision and intervention quality as evaluation depth increases.

The table lists seven prescribing safety indicators with their matched patient counts, prevalence per million, and percentage of time patients met the criteria. Filter 06 (Beta-blocker + asthma) had the highest prevalence at 647 per million, while Filter 28 (NSAID + peptic ulcer) had the lowest at 10 per million. The percentage of time patients matched each filter varied, with Filter 10 (Antipsychotic + dementia) showing the highest continuity at 12.8%, indicating persistent risk, while Filter 33 (Warfarin + antibiotic) showed the lowest at 1.2%, reflecting transient interactions.
