Benchmark Tests AI Clinical Text Understanding in Nine Languages
Researchers at Mass General Brigham have introduced BRIDGE, a comprehensive multilingual benchmark designed to assess how effectively large language models interpret real-world clinical text and electronic health records. The study, published in Nature Biomedical Engineering in 2026, addresses a critical gap in medical AI evaluation by moving beyond standardized examination questions to test models on authentic patient care data. Traditional medical AI benchmarks have predominantly relied on standardized licensing exam questions, which often feature formalized language and isolated medical knowledge that fail to capture the complexity of everyday clinical interactions. BRIDGE shifts this paradigm by evaluating model performance using actual clinical text drawn from electronic health records, clinical case reports, and patient-doctor consultations across nine languages. Led by senior author Jie Yang, Ph.D., FACMI, FAMIA, alongside co-senior author Joshua Lin, MD, MPH, ScD, and co-first authors Jiageng Wu and Bowen Gu, the research team systematically tested ninety-five large language models sourced from fifty-nine distinct providers. The evaluation spanned fourteen clinical specialties and encompassed essential healthcare tasks including patient triage, clinical information extraction, diagnosis, prognosis, and billing coding. The benchmarking process yielded significant insights into current AI capabilities and limitations. While top-performing models achieved near-perfect scores on conventional medical licensing exams, their performance dropped substantially when applied to BRIDGE, with the highest scorer attaining only a 44.8 percent success rate on real-world clinical tasks. This discrepancy underscores a critical disconnect between academic medical knowledge and the nuanced, unstructured language prevalent in actual healthcare settings. Furthermore, the analysis revealed that model accuracy fluctuates considerably across different medical specialties, highlighting the need for domain-specific optimization. To support ongoing evaluation and transparency, the Mass General Brigham team established a publicly accessible, continuously updated leaderboard that currently tracks the performance of 107 large language models across clinical tasks. This resource empowers healthcare providers to select appropriate AI tools for specific clinical contexts while offering developers actionable feedback to refine model architectures. Importantly, the inclusion of nine languages allows researchers to pinpoint performance disparities among non-English-speaking populations, directly supporting the development of more equitable and inclusive healthcare AI. By grounding evaluation in authentic, multilingual clinical data, BRIDGE establishes a rigorous new standard for measuring real-world AI readiness in patient care, ultimately guiding both clinical adoption and future technological improvements.
