Synthetic Medical Data: Promising Advances and Critical Risks in AI Research
Synthetic data hold significant promise for advancing medical research and improving health care, particularly in areas like the rapid analysis of medical images such as X-rays. Unlike real-world data collected from patients, synthetic data are generated by algorithms or mathematical models—sometimes based on real data—but are designed to mimic the statistical patterns of actual clinical information. This approach enables faster hypothesis testing, preliminary model development, and research in low- and middle-income countries where access to real patient data is limited or ethically complex. One key advantage is that synthetic data can reduce privacy risks, making it easier to share datasets across institutions without compromising patient identities. This has led some universities and research bodies to waive traditional ethics review requirements for studies using synthetic data, arguing that since no real individuals are involved, the usual safeguards are unnecessary. However, this shift raises important concerns. First, even though synthetic data are not directly derived from individuals, there remains a risk that people whose original data were used to train the models could be re-identified—especially in early stages of model development. As synthetic data are used to train new models, and those models generate further synthetic data, the link to real-world sources becomes increasingly blurred. But the potential for re-identification still exists, particularly if the synthetic data are highly detailed or specific. Second, and more fundamentally, there is a growing risk of model collapse—where AI systems trained on successive generations of synthetic data begin to generate increasingly inaccurate or nonsensical outputs. This happens when the data lose their real-world grounding and the model starts to reinforce its own errors. Without proper validation, such models may appear reliable but deliver misleading results. To address these challenges, experts emphasize the need for transparency and rigorous validation. Researchers should clearly document how synthetic data were generated, including the algorithms, parameters, and assumptions used. They should also propose ways for independent teams to verify findings. Some, like Randi Foraker at the University of Missouri, are calling for standardized reporting guidelines—similar to those already in place for real data and code—so that synthetic data research can be assessed with the same rigor. Marcel Binz of the Helmholtz Institute for Human-Centred AI highlights the importance of external validation. His team’s model Centaur, trained on over 10 million human decisions from psychology studies, is publicly available and intended to be improved through independent testing. He warns that the current version is likely the weakest, underscoring the need for ongoing scrutiny. While synthetic data can accelerate innovation and expand access to research tools, their use must be guided by caution. The belief that a computer-generated result is automatically valid must be rejected. Robust validation, transparency, and ethical oversight are essential to ensure that AI-driven medical research delivers real, trustworthy benefits.
