HyperAI

Large language models are becoming increasingly adept at solving complex reasoning tasks like math Olympiad problems, scientific questions, and multi-step logic puzzles. However, their current performance comes at a steep computational cost. Generating hundreds of reasoning paths to improve accuracy leads to massive token usage—often over 100 million extra tokens for a single problem—without guaranteed gains. Worse, low-quality or random solutions can skew results, undermining the benefits of majority voting. To tackle this inefficiency, researchers at Meta AI introduced DeepConf, short for "Deep Think with Confidence." This method leverages the model’s internal confidence signals to identify and discard unreliable reasoning paths early, dramatically reducing wasted computation while boosting accuracy. The core idea builds on self-consistency with majority voting: instead of relying on a single answer, the model generates multiple reasoning traces and selects the most frequent outcome. On the AIME 2025 benchmark, a single pass (pass@1) of Qwen3–8B achieves 68% accuracy. With 512 traces and majority voting (conf@512), accuracy jumps to 82%. But this comes at a high cost—both in time and resources. DeepConf improves on this by measuring confidence at the token level. It uses metrics like token entropy and token confidence to assess how certain the model is about each prediction. Low entropy (a sharp probability peak) means high confidence; high entropy (a flat distribution) indicates uncertainty. By averaging these scores across a reasoning trace, the system computes a confidence score for each full solution path. DeepConf then filters out low-confidence traces—essentially ignoring the "guessers" in the classroom—before voting. This selective approach avoids noise and focuses on high-quality reasoning. In offline mode, it evaluates all generated traces and keeps only the top-performing ones based on confidence thresholds. In online mode, it dynamically stops generating a trace as soon as confidence drops below a threshold, saving computation in real time. The results are striking. On AIME 2025, DeepConf@512 with GPT-OSS-120B achieves 99.9% accuracy—nearly perfect—compared to 97.0% with standard majority voting and just 91.8% for a single pass. At the same time, it reduces token generation by up to 84.7% compared to brute-force parallel reasoning. Key confidence measures include group confidence (a local view over recent tokens), tail confidence (focusing on the final reasoning steps), and overall trace confidence. These allow DeepConf to detect weak reasoning early and act decisively. The algorithm works in two modes. Offline, it processes all traces after generation, filtering and voting based on confidence. Online, it starts with a small set of initial traces, uses them to set a confidence threshold, and then generates new traces on the fly—stopping each one early if confidence dips. It continues until either consensus is reached or the trace budget is exhausted. This approach embodies the power of self-doubt. Instead of blindly generating more answers, the model learns when to stop—saving compute, reducing errors, and delivering superior results with smarter thinking. The lesson is clear: intelligence isn’t just about doing more—it’s about doing better. With confidence-aware scaling, AI is becoming not just smarter, but more efficient, frugal, and self-aware. The future of reasoning isn’t brute force—it’s thoughtful precision.

AI’s Self-Doubt Boosts Smarter Reasoning: How Confidence-Aware Thinking Reduces Waste and Maximizes Accuracy in Large Language Models

Related Links