Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM)reasoning often incur substantial computational costs, primarily due toextensive reliance on external Process Reward Models (PRMs) or sampling methodslike Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficientself-guided TTS framework that achieves PRM-level performance without costlyexternal verifier models. Our method employs a lightweight tree search guidedsolely by intrinsic LLM signals, token-level confidence and step novelty. Onecritical innovation is improving the reliability of internal confidenceestimates via a targeted reinforcement learning fine-tuning phase. Empiricalevaluations on challenging mathematical reasoning benchmarks demonstrate thatGG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matchingor surpassing significantly larger models (e.g., 32B-70B parameters), whilereducing GPU memory usage by up to 10x. Compared to PRM-based methods, GGachieves comparable accuracy with 8x faster inference speeds and 4-5x lowermemory usage. Additionally, GG reduces KV cache memory usage by approximately50% compared to the BoN strategy, facilitating more efficient and practicaldeployment of TTS techniques.