Addressing Bias in AI Datasets: Key Questions for Educators and Students
Leo Anthony Celi, a senior research scientist at MIT's Institute for Medical Engineering and Science, a physician at Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School, has highlighted a crucial issue in AI education: the lack of training for students to recognize potential bias in their datasets. This oversight can lead to AI models that are ineffective or harmful when applied to diverse populations, particularly those underrepresented in initial clinical trials. Celi emphasizes that bias often sneaks into datasets due to several factors. For instance, pulse oximeters have been found to overestimate oxygen levels for people of color because clinical trials lacked sufficient representation. Similarly, medical devices are typically optimized for healthy young males, not accounting for older or sicker patients. The FDA only requires proof that devices work on healthy subjects, which further exacerbates the problem. Electronic health records (EHRs), essential for AI in healthcare, are another source of bias. These records were not designed for machine learning and can introduce various issues, such as sampling selection bias, when certain demographics are less likely to be admitted to ICUs. To combat these issues, Celi and his team advocate for a fundamental shift in how AI courses are structured. They recommend dedicating a significant portion—ideally 50%—of the curriculum to understanding the data. Students should be taught to ask critical questions like: "Where did the data come from?" "Who were the observers and collectors?" "What is the landscape of the institutions involved?" and "Who is represented in the dataset?" For example, if the data primarily includes patients who could easily access the ICU, the model might fail to accurately predict outcomes for those who couldn't. One promising approach is the development of transformer models for numeric EHR data, which can help mitigate the effects of missing data caused by social determinants of health and provider biases. Transformer models analyze relationships between lab results, vital signs, and treatments, potentially reducing the impact of incomplete or biased data. Celi initiated a study to assess the current state of AI bias education. Out of 11 courses reviewed, only five mentioned bias, and just two had substantial discussions on the topic. This gap underscores the need for better educational practices. While existing courses are valuable for teaching technical skills, they often neglect the importance of understanding data sources and biases. Celi argues that without this knowledge, students are ill-prepared to create effective and fair AI models. The MIT Critical Data consortium, founded in 2014, organizes datathons worldwide, where healthcare professionals and data scientists collaborate to scrutinize local datasets. This approach fosters critical thinking by bringing together diverse perspectives, including different generations and backgrounds. Celi believes that such environments naturally promote critical thinking, as participants challenge each other's assumptions and approaches. He stresses that students should not rush to build models without a deep understanding of the data. Before starting any project, they should investigate the origin and quality of the data, the devices used for measurement, and whether these devices are consistently accurate across different populations. Acknowledging the imperfections in data is the first step to improving it. For instance, the MIMIC database, which took a decade to develop a reliable schema, improved because of feedback on its initial shortcomings. Celi is also excited about the transformative impact of datathons. Attendees often express a renewed enthusiasm for the field, recognizing both its vast potential and the risks involved. By fostering awareness and critical engagement, Celi hopes to empower the next generation of AI practitioners to create models that are not only technically sound but also ethically responsible. Industry insiders and experts agree that addressing dataset bias is imperative for the ethical and effective deployment of AI in healthcare. Courses that incorporate these elements will better prepare students to navigate the complexities of real-world data, ensuring that their models are robust and applicable to a diverse range of patients. MIT's Institute for Medical Engineering and Science and the MIT Critical Data consortium are making significant strides in this area, pushing for a more comprehensive and nuanced AI education. Leo Celi's background as a researcher, physician, and educator uniquely positions him to bridge the gap between technical expertise and ethical consideration in AI. His work highlights the urgent need for reform in AI education to ensure that future models are both accurate and equitable.