Command Palette
Search for a command to run...
Effizientes Reasoning mit Balanced Thinking
Effizientes Reasoning mit Balanced Thinking
Yulin Li Tengyao Tu Li Ding Junjie Wang Huiling Zhen Yixin Chen Yong Li Zhuotao Tian
Zusammenfassung
Große Reasoning-Modelle (LRMs) haben bemerkenswerte Schlussfolgerungsfähigkeiten gezeigt, leiden jedoch häufig unter „Overthinking" (übermäßiges Nachdenken), bei dem redundante Rechenschritte bei einfachen Problemen aufgewendet werden, oder unter „Underthinking" (unzureichendes Nachdenken), bei dem trotz vorhandener Fähigkeiten nicht genügend Schlussfolgerungspfade erkundet werden. Diese Probleme führen zu Ineffizienzen und potenziellen Ungenauigkeiten, was den praktischen Einsatz in ressourcenbeschränkten Umgebungen einschränkt. Bestehende Methoden zur Minderung von Overthinking, wie das Unterdrücken reflektierender Schlüsselwörter oder die Anpassung der Länge des Reasoning-Prozesses, können unbeabsichtigt Underthinking hervorrufen und somit die Genauigkeit beeinträchtigen. Daher schlagen wir ReBalance vor, ein training-freies Framework, das effizientes Reasoning durch ausgewogenes Denken ermöglicht. ReBalance nutzt Konfidenz als kontinuierlichen Indikator für die Reasoning-Dynamik, identifiziert Overthinking durch eine hohe Varianz der Konfidenz und Underthinking durch konsistente Überkonfidenz. Durch die Aggregation versteckter Zustände aus einem kleinen Datensatz zu Prototypen für Reasoning-Modi berechnen wir einen Steering Vector, um die Reasoning-Trajektorien von LRMs zu steuern. Eine dynamische Kontrollfunktion moduliert die Stärke und Richtung dieses Vektors basierend auf der Echtzeit-Konfidenz, reduziert Redundanzen während des Overthinkings und fördert die Exploration während des Underthinkings. Umfassende Experimente mit vier Modellen im Größenbereich von 0,5B bis 32B sowie über neun Benchmarks in den Bereichen mathematisches Reasoning, allgemeine Fragenbeantwortung und Programmieraufgaben zeigen, dass ReBalance die Ausgabe-Redundanz effektiv reduziert und gleichzeitig die Genauigkeit verbessert. Dies bietet eine allgemeine, training-freie und Plug-and-Play-Strategie für den effizienten und robusten Einsatz von LRMs. Der Code ist verfügbar unter https://github.com/yu-lin-li/ReBalance.
One-sentence Summary
Researchers from Harbin Institute of Technology and collaborating institutes propose REBALANCE, a training-free framework that uses confidence-based steering vectors to dynamically balance reasoning depth. This approach effectively mitigates overthinking and underthinking in Large Reasoning Models, enhancing accuracy and efficiency across math, coding, and general question-answering benchmarks without requiring fine-tuning.
Key Contributions
- The paper introduces REBALANCE, a training-free framework that achieves efficient reasoning by leveraging confidence as a continuous indicator to identify overthinking through high variance and underthinking via consistent overconfidence.
- A steering vector is computed by aggregating hidden states into reasoning mode prototypes, which a dynamic control function modulates in real-time to prune redundancy or promote exploration based on the model's confidence levels.
- Extensive experiments across four models ranging from 0.5B to 32B and nine benchmarks demonstrate that the method effectively reduces output redundancy while simultaneously improving accuracy in math reasoning, general question answering, and coding tasks.
Introduction
Large Reasoning Models (LRMs) excel at complex tasks but often suffer from inefficiency due to overthinking on simple problems or underthinking on difficult ones, which hinders their deployment in resource-constrained environments. Prior attempts to fix overthinking by suppressing reflection or shortening reasoning chains frequently backfire by inducing underthinking, leading to premature and inaccurate conclusions. The authors leverage confidence as a continuous signal to distinguish between these two states and propose REBALANCE, a training-free framework that dynamically steers the model's hidden states to prune redundancy during overthinking while encouraging exploration during underthinking.
Dataset
- The authors curate a diverse evaluation suite spanning mathematics, science, and coding, drawing from established benchmarks like MATH-500, AIME, GSM8K, GPQA DIAMOND, and LIVECODEBENCH.
- The dataset composition includes three difficulty tiers: simple sets like GSM8K (1,319 problems) and AMC23 (40 problems); moderate sets like MATH-500 (500 problems); and hard sets including AIME24/AIME25 (30 problems each), GPQA DIAMOND (198 problems), OLYMPIADBENCH (675 problems), and LIVECODEBENCH v1 (400 problems).
- Specific filtering and sourcing rules apply to each subset, such as using the official 2024/2025 AIME cycles, selecting expert-authored graduate-level questions for GPQA, and ensuring contamination awareness in LIVECODEBENCH by using version v1 with execution-based unit tests.
- For training and evaluation, the authors utilize standard splits where available, such as the ~7.5k training and ~1k test split for GSM8K, while treating other benchmarks as held-out test sets to assess reasoning capabilities.
- The processing pipeline applies a unified prompt template across all math-related subsets, instructing the model to reason step by step and format the final answer within a boxed notation.
Method
The authors propose ReBALANCE, a training-free framework designed to dynamically balance overthinking and underthinking in Large Reasoning Models (LRMs) to improve efficiency without compromising accuracy. The framework operates through a two-stage process involving offline data collection and online inference with dynamic steering. Refer to the framework diagram for a comprehensive overview of the system architecture.
To effectively control the reasoning process, the method first explicitly models reasoning states prone to overthinking or underthinking using stepwise confidence and confidence variance. Overthinking is identified as a state characterized by low confidence and high variance, reflecting unstable or oscillating reasoning trajectories. Conversely, underthinking is defined by persistently high confidence and low variance, indicating premature convergence. Refer to the examples illustrating these distinct reasoning behaviors and the target balanced state.
The framework extracts steering vectors from the hidden states of the LRM to guide the model away from these undesirable modes. During the offline stage, a one-pass data collection is performed on a small seen dataset to identify prototypes for overthinking and underthinking. The authors analyze the linear decodability of confidence signals across layers to automatically select the optimal deep layer for intervention, as visualized in the layer-wise R2 analysis. The steering vector is then constructed as the normalized difference between the overthinking and underthinking prototypes, establishing a direction in the latent space for behavior modulation.
During online inference, a dynamic control function adaptively modulates the steering strength and direction based on real-time model states. This function takes the current stepwise confidence and confidence variance as inputs to compute a steering weight. The weight is designed to push the model's state away from the nearest reasoning boundary, ensuring the trajectory remains within a balanced region. Refer to the visualization of the control function surface, which demonstrates how the steering strength varies non-linearly based on confidence and variance levels to mitigate both overthinking and underthinking.
Experiment
- Analysis of reasoning length distributions reveals that existing overthinking mitigation methods often induce underthinking by prematurely truncating necessary steps, whereas the proposed ReBALANCE method achieves a balanced reduction that preserves accuracy while shortening outputs.
- Experiments demonstrate that confidence variance and step-level confidence serve as reliable indicators for distinguishing between overthinking (high variance, low confidence) and underthinking (persistently high confidence), enabling fine-grained behavioral control without auxiliary models.
- Evaluations across diverse benchmarks in mathematics, science, code, and commonsense reasoning confirm that ReBALANCE significantly reduces token usage and inference latency while improving or maintaining Pass@1 accuracy, outperforming prompt-based and external verifier-based baselines.
- Ablation studies validate that dynamic control based on confidence signals is superior to static adjustments, and that steering vectors extracted from medium-difficulty datasets generalize effectively across different domains and model sizes.
- Additional tests on NPU devices and creative writing tasks show that the method maintains robust performance on specialized hardware and preserves or enhances the model's creative expressiveness and linguistic diversity.