Command Palette
Search for a command to run...
الكفاءة في الاستدلال مع التفكير المتوازن
الكفاءة في الاستدلال مع التفكير المتوازن
Yulin Li Tengyao Tu Li Ding Junjie Wang Huiling Zhen Yixin Chen Yong Li Zhuotao Tian
الملخص
أظهرت نماذج الاستدلال الكبيرة (Large Reasoning Models - LRMs) قدرات استدلالية ملحوظة، إلا أنها غالبًا ما تعاني من "التفكير المفرط" (overthinking)، حيث تستهلك خطوات حسابية زائدة عن الحاجة في مسائل بسيطة، أو من "التفكير الناقص" (underthinking)، حيث تفشل في استكشاف مسارات استدلالية كافية رغم امتلاكها القدرات الكامنة. تؤدي هذه المشكلات إلى عدم كفاءة وحتمية أخطاء محتملة، مما يحد من النشر العملي في البيئات المقيدة الموارد. وقد تؤدي الأساليب الحالية للتخفيف من التفكير المفرط، مثل قمع الكلمات المفتاحية الانعكاسية أو تعديل طول الاستدلال، عن غير قصد إلى تحفيز التفكير الناقص، مما يقوض الدقة.لذلك، نقترح إطار عمل ReBalance، وهو إطار خالٍ من التدريب (training-free) يحقق استدلالًا فعالًا بتفكير متوازن. يعتمد ReBalance على الثقة (confidence) كمؤشر مستمر لديناميكيات الاستدلال، حيث يحدد التفكير المفرط من خلال التباين العالي في الثقة، والتفكير الناقص عبر فرط الثقة المستمر. ومن خلال تجميع الحالات المخفية (hidden states) من مجموعة بيانات صغيرة إلى نماذج أولية لأنماط الاستدلال (reasoning mode prototypes)، نحسب متجه توجيهي (steering vector) لتوجيه مسارات استدلال نماذج LRMs. كما تقوم دالة تحكم ديناميكية بضبط قوة واتجاه هذا المتجه بناءً على الثقة في الوقت الفعلي، مما يؤدي إلى اقتطاع الزوائد أثناء التفكير المفرط، وتعزيز الاستكشاف أثناء التفكير الناقص.أظهرت تجارب موسعة أجريت على أربعة نماذج تتراوح أحجامها من 0.5 مليار إلى 32 مليار معلمة، وعلى تسعة معايير تقييم (benchmarks) في مجالات الاستدلال الرياضي، والإجابة على الأسئلة العامة، ومهام البرمجة، أن ReBalance يقلل بفعالية من تكرار المخرجات مع تحسين الدقة، مقدمًا استراتيجية عامة، خالية من التدريب، وقابلة للتوصيل الفوري (plug-and-play) لنشر نماذج LRMs بكفاءة وموثوقية. الكود متاح على: https://github.com/yu-lin-li/ReBalance .
One-sentence Summary
Researchers from Harbin Institute of Technology and collaborating institutes propose REBALANCE, a training-free framework that uses confidence-based steering vectors to dynamically balance reasoning depth. This approach effectively mitigates overthinking and underthinking in Large Reasoning Models, enhancing accuracy and efficiency across math, coding, and general question-answering benchmarks without requiring fine-tuning.
Key Contributions
- The paper introduces REBALANCE, a training-free framework that achieves efficient reasoning by leveraging confidence as a continuous indicator to identify overthinking through high variance and underthinking via consistent overconfidence.
- A steering vector is computed by aggregating hidden states into reasoning mode prototypes, which a dynamic control function modulates in real-time to prune redundancy or promote exploration based on the model's confidence levels.
- Extensive experiments across four models ranging from 0.5B to 32B and nine benchmarks demonstrate that the method effectively reduces output redundancy while simultaneously improving accuracy in math reasoning, general question answering, and coding tasks.
Introduction
Large Reasoning Models (LRMs) excel at complex tasks but often suffer from inefficiency due to overthinking on simple problems or underthinking on difficult ones, which hinders their deployment in resource-constrained environments. Prior attempts to fix overthinking by suppressing reflection or shortening reasoning chains frequently backfire by inducing underthinking, leading to premature and inaccurate conclusions. The authors leverage confidence as a continuous signal to distinguish between these two states and propose REBALANCE, a training-free framework that dynamically steers the model's hidden states to prune redundancy during overthinking while encouraging exploration during underthinking.
Dataset
- The authors curate a diverse evaluation suite spanning mathematics, science, and coding, drawing from established benchmarks like MATH-500, AIME, GSM8K, GPQA DIAMOND, and LIVECODEBENCH.
- The dataset composition includes three difficulty tiers: simple sets like GSM8K (1,319 problems) and AMC23 (40 problems); moderate sets like MATH-500 (500 problems); and hard sets including AIME24/AIME25 (30 problems each), GPQA DIAMOND (198 problems), OLYMPIADBENCH (675 problems), and LIVECODEBENCH v1 (400 problems).
- Specific filtering and sourcing rules apply to each subset, such as using the official 2024/2025 AIME cycles, selecting expert-authored graduate-level questions for GPQA, and ensuring contamination awareness in LIVECODEBENCH by using version v1 with execution-based unit tests.
- For training and evaluation, the authors utilize standard splits where available, such as the ~7.5k training and ~1k test split for GSM8K, while treating other benchmarks as held-out test sets to assess reasoning capabilities.
- The processing pipeline applies a unified prompt template across all math-related subsets, instructing the model to reason step by step and format the final answer within a boxed notation.
Method
The authors propose ReBALANCE, a training-free framework designed to dynamically balance overthinking and underthinking in Large Reasoning Models (LRMs) to improve efficiency without compromising accuracy. The framework operates through a two-stage process involving offline data collection and online inference with dynamic steering. Refer to the framework diagram for a comprehensive overview of the system architecture.
To effectively control the reasoning process, the method first explicitly models reasoning states prone to overthinking or underthinking using stepwise confidence and confidence variance. Overthinking is identified as a state characterized by low confidence and high variance, reflecting unstable or oscillating reasoning trajectories. Conversely, underthinking is defined by persistently high confidence and low variance, indicating premature convergence. Refer to the examples illustrating these distinct reasoning behaviors and the target balanced state.
The framework extracts steering vectors from the hidden states of the LRM to guide the model away from these undesirable modes. During the offline stage, a one-pass data collection is performed on a small seen dataset to identify prototypes for overthinking and underthinking. The authors analyze the linear decodability of confidence signals across layers to automatically select the optimal deep layer for intervention, as visualized in the layer-wise R2 analysis. The steering vector is then constructed as the normalized difference between the overthinking and underthinking prototypes, establishing a direction in the latent space for behavior modulation.
During online inference, a dynamic control function adaptively modulates the steering strength and direction based on real-time model states. This function takes the current stepwise confidence and confidence variance as inputs to compute a steering weight. The weight is designed to push the model's state away from the nearest reasoning boundary, ensuring the trajectory remains within a balanced region. Refer to the visualization of the control function surface, which demonstrates how the steering strength varies non-linearly based on confidence and variance levels to mitigate both overthinking and underthinking.
Experiment
- Analysis of reasoning length distributions reveals that existing overthinking mitigation methods often induce underthinking by prematurely truncating necessary steps, whereas the proposed ReBALANCE method achieves a balanced reduction that preserves accuracy while shortening outputs.
- Experiments demonstrate that confidence variance and step-level confidence serve as reliable indicators for distinguishing between overthinking (high variance, low confidence) and underthinking (persistently high confidence), enabling fine-grained behavioral control without auxiliary models.
- Evaluations across diverse benchmarks in mathematics, science, code, and commonsense reasoning confirm that ReBALANCE significantly reduces token usage and inference latency while improving or maintaining Pass@1 accuracy, outperforming prompt-based and external verifier-based baselines.
- Ablation studies validate that dynamic control based on confidence signals is superior to static adjustments, and that steering vectors extracted from medium-difficulty datasets generalize effectively across different domains and model sizes.
- Additional tests on NPU devices and creative writing tasks show that the method maintains robust performance on specialized hardware and preserves or enhances the model's creative expressiveness and linguistic diversity.