Generative AI Data Drives Medical Model Innovation
Recently, Professor Sheng Bin from Shanghai Jiao Tong University's School of Computer Science and the Key Laboratory of Artificial Intelligence (Ministry of Education) collaborated with Professor Huang Tianyin from Tsinghua University's School of Medicine, Professor Pearse Keane from University College London's (UCL) Institute of Ophthalmology, and Professor Qin Yuzhong from the National University of Singapore's School of Medicine. Their collective work, titled "Synthetic Data Boosts Medical Foundation Models," was published in the journal Nature Biomedical Engineering (DOI: https://doi.org/10.1038/s41551-025-01365-0). The article delves into the profound impact of generative artificial intelligence (AI), such as generative adversarial networks (GANs) and diffusion models, on medical foundation models. Generative AI has significant scientific value by overcoming data bottlenecks, facilitating multimodal integration, and advancing causal reasoning. Practically, it enhances diagnostic accuracy, accelerates personalized treatment, and optimizes healthcare resource allocation. This innovation is driving a paradigm shift in medicine, moving from data-driven approaches to knowledge-driven ones. It has particularly addressed China’s long-standing challenges in the medical data ecosystem, which include stringent privacy protection laws (such as the Data Security Law and Personal Information Protection Law), high annotation costs (with each medical image requiring several hours of expert labor), and severe data silos (with cross-institutional data sharing rates below 30%). In the current global competition in medical AI, where the focus is on combining foundational models with large models, generative AI offers a breakthrough solution. By precisely simulating the distribution characteristics of real-world data, it can expand a small dataset of a few thousand cases to a training set of millions, thereby providing essential support for rare diseases and pediatric fields where data is scarce. Importantly, this technology also helps build an autonomous and controllable medical data ecosystem in China, addressing critical national needs. In March, Professor Yan Bo's team from Fudan University published a promising study in Nature Biomedical Engineering that used generative AI data to construct a foundational model for ophthalmology. In response to this groundbreaking work, Professor Sheng Bin and his collaborators published their comprehensive commentary. They acknowledged the significant contributions of the Fudan University study but also highlighted pivotal questions related to the role of AI-generated data in building these foundational models. Firstly, while AI-generated data may reduce privacy risks associated with real-world medical data, it does not eliminate them entirely. Secondly, the opaque nature of foundation models makes it difficult to pinpoint the reasons for performance degradation or failure when trained primarily on synthetic data. This ambiguity can lead to uncertainties about the quality or “toxicity” of the AI-generated data. Thirdly, using limited real-world disease labels to guide the generation of synthetic data can inadvertently reinforce biases inherent in small datasets, potentially compromising the fairness, equity, and generalizability of the models, especially for rare diseases and underrepresented groups. The amount of real-world data required to build robust foundation models remains unclear, and the performance of models trained solely on synthetic data is still unknown. Therefore, it is crucial to establish guidelines and standards to ensure the traceability and origin of both real-world and synthetic data in medical AI. The commentary emphasizes that the potential benefits of synthetic data must be balanced with rigorous validation, ethical considerations, and a commitment to improving real-world data collection methods. Moreover, the current AI models fall short of accurately capturing the complex nuances of human health, which encompasses biological, psychological, and environmental factors. To develop a truly universal or world model, a more comprehensive approach that integrates real-world data and synthetic data is essential. Real-world data remains the bedrock, providing rich, authentic information that cannot be fully replicated by synthetic data alone. There are still vast unexplored territories in human biology and health, such as the mechanisms behind many rare diseases and the relationships between environmental factors and chronic conditions. Enhancing the efficiency and universality of real-world data collection continues to be a top priority in medical research and AI applications. However, synthetic data holds promise as a complementary tool, assisting in data expansion and model training, and thus accelerating research progress. Looking ahead, the commentary suggests that the application of generative AI in medicine extends beyond merely technological advancements; it heralds a transformative change in healthcare services. This cutting-edge technology could significantly advance China’s medical AI capabilities, promoting technological independence and innovation. Given the increasing international competition in medical AI, generative AI presents a significant opportunity for breaking through technological monopolies and achieving autonomy in key areas like base algorithms. To fully realize this potential, comprehensive institutional frameworks must be developed to facilitate the smooth transition of generative AI from experimental breakthroughs to widespread adoption. Only then can this advanced technology contribute continuously to the strategic goals of improving healthcare, benefiting the broader population, and elevating China’s overall medical standards.