Command Palette
Search for a command to run...
Raisonnement efficace avec une pensée équilibrée
Raisonnement efficace avec une pensée équilibrée
Yulin Li Tengyao Tu Li Ding Junjie Wang Huiling Zhen Yixin Chen Yong Li Zhuotao Tian
Résumé
Les modèles de raisonnement à grande échelle (Large Reasoning Models, LRMs) ont démontré des capacités de raisonnement remarquables ; cependant, ils souffrent souvent de surréflexion, en consacrant des étapes de calcul redondantes à des problèmes simples, ou de sous-réflexion, en ne parcourant pas suffisamment de chemins de raisonnement malgré leurs capacités intrinsèques. Ces problèmes entraînent des inefficacités et des imprécisions potentielles, limitant leur déploiement pratique dans des environnements contraints en ressources. Les méthodes existantes visant à atténuer la surréflexion, telles que la suppression de mots-clés réflexifs ou l'ajustement de la longueur du raisonnement, peuvent involontairement induire une sous-réflexion, compromettant ainsi la précision. Par conséquent, nous proposons ReBalance, un cadre sans entraînement qui permet un raisonnement efficace grâce à une pensée équilibrée. ReBalance exploite la confiance comme indicateur continu de la dynamique du raisonnement, identifiant la surréflexion par une variance élevée de la confiance et la sous-réflexion par une surconfiance persistante. En agrégeant les états cachés issus d'un jeu de données de petite taille pour former des prototypes de modes de raisonnement, nous calculons un vecteur de guidage destiné à orienter les trajectoires de raisonnement des LRMs. Une fonction de contrôle dynamique module l'intensité et la direction de ce vecteur en fonction de la confiance en temps réel, élaguant les redondances lors de la surréflexion et favorisant l'exploration lors de la sous-réflexion. Des expériences approfondies menées sur quatre modèles allant de 0,5 milliard à 32 milliards de paramètres, et couvrant neuf benchmarks dans les domaines du raisonnement mathématique, de la réponse aux questions générales et des tâches de codage, démontrent que ReBalance réduit efficacement la redondance des sorties tout en améliorant la précision. Il offre ainsi une stratégie générale, sans entraînement et plug-and-play pour un déploiement efficace et robuste des LRMs. Le code est disponible à l'adresse suivante : https://github.com/yu-lin-li/ReBalance.
One-sentence Summary
Researchers from Harbin Institute of Technology and collaborating institutes propose REBALANCE, a training-free framework that uses confidence-based steering vectors to dynamically balance reasoning depth. This approach effectively mitigates overthinking and underthinking in Large Reasoning Models, enhancing accuracy and efficiency across math, coding, and general question-answering benchmarks without requiring fine-tuning.
Key Contributions
- The paper introduces REBALANCE, a training-free framework that achieves efficient reasoning by leveraging confidence as a continuous indicator to identify overthinking through high variance and underthinking via consistent overconfidence.
- A steering vector is computed by aggregating hidden states into reasoning mode prototypes, which a dynamic control function modulates in real-time to prune redundancy or promote exploration based on the model's confidence levels.
- Extensive experiments across four models ranging from 0.5B to 32B and nine benchmarks demonstrate that the method effectively reduces output redundancy while simultaneously improving accuracy in math reasoning, general question answering, and coding tasks.
Introduction
Large Reasoning Models (LRMs) excel at complex tasks but often suffer from inefficiency due to overthinking on simple problems or underthinking on difficult ones, which hinders their deployment in resource-constrained environments. Prior attempts to fix overthinking by suppressing reflection or shortening reasoning chains frequently backfire by inducing underthinking, leading to premature and inaccurate conclusions. The authors leverage confidence as a continuous signal to distinguish between these two states and propose REBALANCE, a training-free framework that dynamically steers the model's hidden states to prune redundancy during overthinking while encouraging exploration during underthinking.
Dataset
- The authors curate a diverse evaluation suite spanning mathematics, science, and coding, drawing from established benchmarks like MATH-500, AIME, GSM8K, GPQA DIAMOND, and LIVECODEBENCH.
- The dataset composition includes three difficulty tiers: simple sets like GSM8K (1,319 problems) and AMC23 (40 problems); moderate sets like MATH-500 (500 problems); and hard sets including AIME24/AIME25 (30 problems each), GPQA DIAMOND (198 problems), OLYMPIADBENCH (675 problems), and LIVECODEBENCH v1 (400 problems).
- Specific filtering and sourcing rules apply to each subset, such as using the official 2024/2025 AIME cycles, selecting expert-authored graduate-level questions for GPQA, and ensuring contamination awareness in LIVECODEBENCH by using version v1 with execution-based unit tests.
- For training and evaluation, the authors utilize standard splits where available, such as the ~7.5k training and ~1k test split for GSM8K, while treating other benchmarks as held-out test sets to assess reasoning capabilities.
- The processing pipeline applies a unified prompt template across all math-related subsets, instructing the model to reason step by step and format the final answer within a boxed notation.
Method
The authors propose ReBALANCE, a training-free framework designed to dynamically balance overthinking and underthinking in Large Reasoning Models (LRMs) to improve efficiency without compromising accuracy. The framework operates through a two-stage process involving offline data collection and online inference with dynamic steering. Refer to the framework diagram for a comprehensive overview of the system architecture.
To effectively control the reasoning process, the method first explicitly models reasoning states prone to overthinking or underthinking using stepwise confidence and confidence variance. Overthinking is identified as a state characterized by low confidence and high variance, reflecting unstable or oscillating reasoning trajectories. Conversely, underthinking is defined by persistently high confidence and low variance, indicating premature convergence. Refer to the examples illustrating these distinct reasoning behaviors and the target balanced state.
The framework extracts steering vectors from the hidden states of the LRM to guide the model away from these undesirable modes. During the offline stage, a one-pass data collection is performed on a small seen dataset to identify prototypes for overthinking and underthinking. The authors analyze the linear decodability of confidence signals across layers to automatically select the optimal deep layer for intervention, as visualized in the layer-wise R2 analysis. The steering vector is then constructed as the normalized difference between the overthinking and underthinking prototypes, establishing a direction in the latent space for behavior modulation.
During online inference, a dynamic control function adaptively modulates the steering strength and direction based on real-time model states. This function takes the current stepwise confidence and confidence variance as inputs to compute a steering weight. The weight is designed to push the model's state away from the nearest reasoning boundary, ensuring the trajectory remains within a balanced region. Refer to the visualization of the control function surface, which demonstrates how the steering strength varies non-linearly based on confidence and variance levels to mitigate both overthinking and underthinking.
Experiment
- Analysis of reasoning length distributions reveals that existing overthinking mitigation methods often induce underthinking by prematurely truncating necessary steps, whereas the proposed ReBALANCE method achieves a balanced reduction that preserves accuracy while shortening outputs.
- Experiments demonstrate that confidence variance and step-level confidence serve as reliable indicators for distinguishing between overthinking (high variance, low confidence) and underthinking (persistently high confidence), enabling fine-grained behavioral control without auxiliary models.
- Evaluations across diverse benchmarks in mathematics, science, code, and commonsense reasoning confirm that ReBALANCE significantly reduces token usage and inference latency while improving or maintaining Pass@1 accuracy, outperforming prompt-based and external verifier-based baselines.
- Ablation studies validate that dynamic control based on confidence signals is superior to static adjustments, and that steering vectors extracted from medium-difficulty datasets generalize effectively across different domains and model sizes.
- Additional tests on NPU devices and creative writing tasks show that the method maintains robust performance on specialized hardware and preserves or enhances the model's creative expressiveness and linguistic diversity.