HyperAIHyperAI

Command Palette

Search for a command to run...

Teaching AI to say "I'm not sure"

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new technique to solve the persistent problem of overconfidence in large language models. Published findings indicate that current reasoning systems often deliver answers with unwarranted certainty, regardless of whether the information is accurate or a guess. The team introduced a method called Reinforcement Learning with Calibration Rewards, or RLCR, which trains models to express appropriate levels of uncertainty without sacrificing accuracy. The root of the issue lies in standard reinforcement learning protocols. Recent breakthroughs in AI reasoning, such as those used in systems like OpenAI's o1, reward models solely for correct answers and penalize errors. There is no intermediate reward for honest uncertainty. Consequently, models learn to provide confident responses to every query, even when they are essentially guessing. This creates significant risks for high-stakes fields like medicine, law, and finance, where users rely on AI confidence scores to make decisions. A system that claims 95 percent certainty but is only right half the time is more dangerous than a model that simply provides a wrong answer, as it misleads users into skipping further verification. Mehul Damani and Isha Puri, lead authors of the study, explained that standard training gives models no incentive to admit ignorance. RLCR addresses this by adding a Brier score to the reward function. This mathematical metric penalizes the gap between a model's stated confidence and its actual accuracy. During training, the model learns to simultaneously generate an answer and a confidence score. Confidently incorrect answers are penalized, as are correct answers given with unnecessary doubt. The researchers formally proved that this reward structure guarantees both accuracy and well-calibrated uncertainty. Testing the approach on a 7-billion-parameter model across multiple benchmarks yielded remarkable results. RLCR reduced calibration errors by up to 90 percent while maintaining or improving overall accuracy. This improvement held true for tasks the model had been trained on as well as entirely new scenarios. The study found that ordinary reinforcement learning training actually degrades a model's ability to estimate uncertainty, a flaw RLCR successfully reverses. The method also outperformed post-hoc approaches, which attempt to assign confidence scores after the model has already generated an answer. Practical applications of the technique were also demonstrated. When models generate multiple candidate answers, selecting the one with the highest self-reported confidence or weighting votes by confidence levels improves both accuracy and calibration. Furthermore, the research showed that the model's explicit reasoning about its own uncertainty contains valuable information. Classifiers trained on model outputs performed better when this uncertainty reasoning was included in the input, proving that self-reflection is a useful feature rather than mere decoration. The findings will be presented at the International Conference on Learning Representations later this month. The paper includes contributions from Stewart Slocum, Idan Shenfeld, Leshem Choshen, and senior authors Jacob Andreas and Yoon Kim. This development marks a significant step toward making AI systems more reliable and trustworthy by ensuring they know when they do not know.

Related Links

Teaching AI to say "I'm not sure" | Trending Stories | HyperAI