HyperAI

When AI systems outperform human experts in medical diagnosis, does human-AI collaboration automatically lead to better outcomes? A new study from the University of Chicago presents a surprising answer. The research focused on prostate cancer diagnosis using magnetic resonance imaging (MRI), a clinically challenging task where even experienced radiologists struggle with accuracy. The study’s lead author, Dr. Chacha Chen from the University of Chicago, explained to DeepTech that prostate MRI diagnosis is a real-world problem with high difficulty. Unlike many prior AI medical studies that focus on domains where doctors already achieve over 90% accuracy, this area offers more room for improvement—and greater potential for AI to make a meaningful impact. The research team trained an AI model based on the nnU-Net architecture using the publicly available PI-CAI dataset, which includes 1,411 cases. The model achieved AUROC scores of 0.730 and 0.790 on the test set—significantly higher than the average performance of eight experienced radiologists from the U.S. and Europe (aged 29 to 52), who had extensive experience in prostate MRI interpretation. Two clinical deployment scenarios were simulated. In the first phase, radiologists independently diagnosed 75 cases, then reviewed AI predictions before finalizing their decisions. Thirty days later, in the second phase, they received detailed feedback on their individual performance and diagnosed 100 new cases with AI recommendations presented upfront. Results confirmed AI’s value in support—but also revealed a critical bottleneck in human-AI collaboration. While the doctors’ average accuracy improved from 63.2% to 66.2% with AI assistance, it still fell short of the AI model’s own 69.3% accuracy. Why? Chen observed that doctors tended to trust AI more, but lacked the ability to discern when the AI was correct or wrong. When their initial diagnosis conflicted with the AI’s prediction—on average in 22.6 cases—doctors changed their judgment in only 4.6 cases, a rate of just 20.4%. In these conflicting cases, doctors’ accuracy dropped to 44.4%, far below their overall performance. This suggests that doctors often rely on their own judgment precisely when they need AI help the most. In the second phase, performance feedback and upfront AI presentation slightly increased AI adoption from 75.5% to 78.4%, but did not significantly improve diagnostic accuracy. Simply providing data isn’t enough—changing decision-making habits remains a challenge. The team then explored a different approach: collective intelligence. Instead of relying on individual doctors, they aggregated the AI-assisted diagnoses from all eight radiologists using a majority vote. This human-AI ensemble achieved a remarkable 73.3% accuracy—surpassing both the individual doctors (63.2%) and the AI model alone (69.3%). “This result is crucial,” Chen emphasized. “It shows that humans and AI can truly complement each other. Only when they complement each other can the team outperform either party alone.” The findings suggest that the key to effective AI integration in medicine may not lie in making individual doctors better at using AI tools—but in designing collaborative systems where human expertise and AI capabilities are combined through collective decision-making. Looking ahead, Chen believes the path forward involves not only improving AI models for greater precision, but also enhancing doctors’ understanding of AI’s strengths and limitations. “We need to clearly communicate what AI excels at—and where it may struggle—so doctors can develop more appropriate trust and use it more effectively.” The study, titled “Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis,” was presented at the ACM Conference on Fairness, Accountability, and Transparency. Dr. Chacha Chen is the first author, and Professor Chenhao Tan from the University of Chicago is the corresponding author.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

AI Surpasses Doctors in Prostate Cancer Diagnosis, Yet Human-AI Collaboration Has Its Limits

Related Links

Command Palette

AI Surpasses Doctors in Prostate Cancer Diagnosis, Yet Human-AI Collaboration Has Its Limits

Related Links

Command Palette

AI Surpasses Doctors in Prostate Cancer Diagnosis, Yet Human-AI Collaboration Has Its Limits

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.