HyperAIHyperAI

Command Palette

Search for a command to run...

منذ يوم واحد
التعلم القوي
LLM

مزج سلوك منطقة الثقة للاستخلاص على السياسة

Daniil Plyusov Alexey Gorbatovski Alexey Malakhov Nikita Balagansky Boris Shaposhnikov Daria Korotyshova Daniil Gavrilov

الملخص

يُدرَّب الطالب في التقطير القائم على السياسة (OPD) على سوابق مُستخرجة من سياسته الخاصة، مع محاكاة معلم أقوى. يعالج هذا عدم تطابق السوابق في التقطير غير المتزامن، إلا أن عمليات التنفيذ الأولية للطالب قد تظل ضعيفة، مما يعرض الإشراف المعلمي لسوابق ضعيفة أو منخفضة الجودة. نقترح مزج السلوكيات ضمن منطقة الثقة (TRB)، وهي طريقة إحماء أولي تستبدل سياسة التنفيذ الأولية بسياسة سلوكية أقرب ما تكون إلى المعلم داخل منطقة ثقة KL مركزها الطالب، مع الحفاظ على خسارة OPD المعكوسة KL لكل سابقة دون تغيير. تُخفَّض ميزانية KL تدريجياً إلى الصفر، مما يعيد التدريب إلى عمليات التنفيذ الخاصة بالطالب فقط بعد مرحلة الإحماء. وفي إطار إعدادي التقطير للاستدلال الرياضي، حقق TRB أعلى متوسط أداء مقارنة بالطرق الأخرى.

One-sentence Summary

Trust-Region behavior Blending (TRB) is a warmup method for on-policy distillation that replaces poor early student rollouts with a closest-to-teacher behavior policy inside a student-centered KL trust region, preserving the per-prefix reverse-KL loss while achieving the strongest average performance across two mathematical reasoning distillation settings.

Key Contributions

  • Trust-Region behavior Blending (TRB) replaces early student rollouts in on-policy distillation with a behavior policy constrained within a student-centered KL trust region to preserve supervision quality.
  • The method optimizes the early prefix distribution instead of altering post-generation targets or token selection, and its KL budget anneals to zero to restore pure student rollouts after the warmup horizon.
  • Experiments across two math-reasoning distillation settings show that TRB achieves the strongest average performance among vanilla on-policy distillation and alternative teacher-guidance baselines.

Introduction

Knowledge distillation transfers reasoning capabilities from large teacher models to smaller student models, but traditional methods suffer from exposure bias because students train on fixed teacher prefixes rather than their own rollouts. On-policy distillation addresses this by supervising the student on trajectories it actually generates, yet early training remains brittle since weak students produce low-quality prefixes and stronger teacher intervention risks breaking the on-policy guarantee. The authors leverage Trust-Region behavior Blending (TRB) to stabilize early rollout collection by optimizing a teacher-guided behavior policy within a strict KL trust region around the student. This approach delivers targeted teacher supervision during the critical warmup phase without modifying the underlying distillation objective, and the blending mechanism is systematically annealed once the student policy matures.

Method

The authors leverage a trust-region optimization framework to design Trust-Region behavior Blending (TRB), a warmup strategy for on-policy distillation (OPD) that improves early training stability by guiding rollout generation toward the teacher policy while respecting a student-centered trust region. The core idea is to replace the student’s own policy for generating prefixes during the initial phase of training with a behavior policy that balances proximity to the teacher and adherence to the student’s current behavior. This approach preserves the reverse-KL OPD loss, which remains unchanged throughout training, but modifies the sampling distribution used to collect prefixes. The behavior policy is defined as the closest-to-teacher policy within a KL-divergence constraint relative to the current student policy, ensuring that the generated prefixes are both teacher-informed and not excessively divergent from the student’s current behavior.

Refer to the framework diagram, which illustrates the optimization process at a given prefix hhh. The student policy πS\pi_SπS is represented as a point in the policy space, and the teacher policy πT\pi_TπT is another point. The dashed circle around πS\pi_SπS represents the student-centered KL trust region, defined by the constraint DKL(μπS)εD_{\text{KL}}(\mu \parallel \pi_S) \leq \varepsilonDKL(μπS)ε. The behavior policy μ\mu^*μ is the solution to minimizing the KL divergence to the teacher policy DKL(μπT)D_{\text{KL}}(\mu \parallel \pi_T)DKL(μπT) subject to this constraint. This solution lies on the boundary of the trust region and is the closest policy to the teacher within the allowed deviation. The figure highlights that the optimal policy μ\mu^*μ is obtained by balancing the influence of the teacher and the student, with the constraint ensuring that the policy does not stray too far from the student’s current distribution.

The per-prefix behavior policy is defined as the solution to a constrained optimization problem. At a prefix hhh, the goal is to find a policy μ(h)\mu^*(\cdot \mid h)μ(h) that minimizes the KL divergence to the teacher policy πT(h)\pi_T(\cdot \mid h)πT(h) while ensuring that the KL divergence to the student policy πS(h)\pi_S(\cdot \mid h)πS(h) does not exceed a predefined budget ε\varepsilonε. This is formulated as:

μ(h)=argminμDKL(μπT)s.t.DKL(μπS)ε,aμ(a)=1,μ(a)0.\begin{array}{r l} & \mu^{*}(\cdot \mid h) = \underset{\mu}{\arg\min} \, D_{\mathrm{KL}}(\mu \parallel \pi_{T}) \\ & \quad \mathrm{s.t.} \quad D_{\mathrm{KL}}(\mu \parallel \pi_{S}) \leq \varepsilon, \\ & \quad \quad \quad \sum_{a} \mu(a) = 1, \quad \mu(a) \geq 0. \end{array}μ(h)=μargminDKL(μπT)s.t.DKL(μπS)ε,aμ(a)=1,μ(a)0.

This optimization selects the most teacher-like distribution that remains within a local deviation from the current student policy, effectively guiding the student toward the teacher while maintaining stability. The solution to this problem has a closed-form expression that combines the student and teacher policies in a weighted manner. Specifically, the behavior policy is given by:

μβ(ah)=πS(ah)1βπT(ah)βZβ(h),\mu_{\beta}(a \mid h) = \frac{\pi_{S}(a \mid h)^{1 - \beta} \pi_{T}(a \mid h)^{\beta}}{Z_{\beta}(h)},μβ(ah)=Zβ(h)πS(ah)1βπT(ah)β,

where β[0,1]\beta \in [0,1]β[0,1] controls the degree of influence from the teacher, and Zβ(h)Z_{\beta}(h)Zβ(h) is a normalization constant. The optimal value β(h)\beta^*(h)β(h) is the largest feasible β\betaβ such that the KL divergence constraint is satisfied:

β(h)=max{β[0,1]DKL(μβπS)ε}.\beta^{*}(h) = \max \left\{ \beta \in [0, 1] \mid D_{\mathrm{KL}}(\mu_{\beta} \parallel \pi_{S}) \leq \varepsilon \right\}.β(h)=max{β[0,1]DKL(μβπS)ε}.

When ε=0\varepsilon = 0ε=0, the policy remains unchanged from the student policy, and if the teacher policy itself satisfies the constraint, the behavior policy becomes the teacher policy. Otherwise, β(h)\beta^*(h)β(h) is computed via binary search, leveraging the monotonicity of DKL(μβπS)D_{\mathrm{KL}}(\mu_{\beta} \parallel \pi_{S})DKL(μβπS) in β\betaβ, which ensures convergence.

To further control the transition from teacher-guided to student-driven sampling, TRB employs an annealed warmup schedule. The KL budget ε\varepsilonε is gradually reduced over a warmup horizon KKK, starting from an initial value ε0\varepsilon_0ε0 and decreasing linearly to zero. The budget at step kkk is given by:

εk=ε0(1kK),kK.\varepsilon_k = \varepsilon_0 \left(1 - \frac{k}{K}\right), \quad k \leq K.εk=ε0(1Kk),kK.

This annealing ensures that early rollouts are more influenced by the teacher, providing stable supervision, while the policy gradually reverts to pure student rollouts as training progresses, allowing the student to eventually learn from its own behavior. This mechanism introduces two key hyperparameters: the initial KL budget ε0\varepsilon_0ε0 and the warmup horizon KKK, which determine the strength and duration of the teacher guidance during the warmup phase.

Experiment

The evaluation compares TRB against vanilla OPD and persistent off-policy baselines across two Qwen3 teacher-student model pairs to determine whether limited early behavioral guidance improves final reasoning performance. Benchmark comparisons validate that TRB achieves superior average outcomes by leveraging the same local solver more effectively than fixed blending strategies, while early-training analyses confirm that targeted warmup interventions successfully shift initial rollouts toward teacher-aligned prefixes. These findings indicate that brief early guidance effectively resolves initial student-teacher misalignment, whereas maintaining off-policy behavior throughout training yields diminishing returns and unnecessary computational overhead.

The authors conduct experiments to evaluate TRB, a method that uses limited early guidance from the teacher to improve OPD outcomes. Results show that TRB achieves the highest average performance across two model-pair settings, outperforming baselines that use stronger or more persistent off-policy guidance. The method is most effective during the initial training phase, where it shifts the student's early rollouts toward more promising prefixes, with diminishing returns once the student and teacher behavior aligns. TRB achieves the best average performance in both model-pair settings compared to other off-policy methods. TRB's effectiveness is concentrated during early training, with no need for persistent teacher guidance. The method improves early student rollouts by guiding them toward prefixes that are more likely to succeed under both teacher and student continuation.

The authors evaluate Trust-Region Behavior Blending (TRB) against several baselines in two model-pair settings, showing that TRB achieves the highest average performance across all tasks. Results indicate that early guidance from the teacher improves final outcomes, with TRB outperforming methods that apply teacher guidance persistently or use different warmup strategies. The best-performing methods consistently achieve higher scores in the Qwen3-1.7B-Base setup compared to the smaller Qwen3-0.6B-Base setup. TRB achieves the best average performance in both model-pair settings, outperforming all baselines. TRB shows superior results compared to methods with persistent teacher guidance or fixed epsilon blending. The performance gap between TRB and other methods is most pronounced in the larger model-pair setup.

The authors analyze the impact of early behavior guidance on off-policy distillation by comparing teacher and student continuation strategies during the warmup phase. Results show that teacher-guided continuation consistently outperforms student continuation across different prefix truncation lengths, with the performance gap increasing as the prefix length grows. The relative gain of teacher continuation over vanilla OPD is higher at longer truncation lengths, indicating that early guidance becomes more effective when the student's initial rollouts are further from the teacher's distribution. Teacher continuation provides higher relative gains than student continuation across all prefix lengths. The performance advantage of teacher continuation increases with longer prefix truncation lengths. The relative gain of teacher continuation over vanilla OPD is more pronounced at longer prefix lengths.

The authors evaluate Trust-Region Behavior Blending against multiple off-policy baselines across two model-pair configurations to determine how limited early teacher guidance impacts distillation outcomes. The primary experiments validate that strategically applying brief teacher guidance during initial training significantly improves final performance by steering early rollouts toward successful prefixes, while continuous guidance yields diminishing returns. A secondary analysis of warmup strategies confirms that teacher-guided continuation consistently outperforms student-led approaches, with the most substantial improvements occurring when initial rollouts diverge substantially from the target distribution. Collectively, these findings establish that targeted early intervention is a more efficient and effective strategy than persistent supervision for optimizing off-policy distillation.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp