HyperAIHyperAI

Command Palette

Search for a command to run...

كشف التلطخ غير السياسي: أين يساعد، وأين يضر، ولماذا

Mohammadreza Armandpour Fatih Ilhan David Harrison Ajay Jaiswal Duc N.M Hoang Farash Faghri Yizhe Zhang Minsik Cho Mehrdad Farajtabar

الملخص

توفر عملية التقريب السلوكي المتزامن (On-policy distillation) رقابة كثيفة ومعتمدة على كل رمز (token) لتدريب نماذج الاستدلال. ومع ذلك، لا يزال من غير الواضح تحت أي شروط تكون هذه الإشارة مفيدة، وأين تصبح ضارة. أي نموذج معلم (Teacher model) ينبغي استخدامه؟ وفي حالة التقريب الذاتي (Self-distillation)، أي سياق محدد ينبغي أن يعمل كإشارة رقابة؟ وهل يختلف الاختيار الأمثل من رمز لآخر؟في الوقت الراهن، يتطلب الإجابة على هذه الأسئلة عادةً عمليات تدريب مكلفة، حيث تخفي مقاييس الأداء الجماعية الديناميكيات على مستوى الرموز الفردية. نقدم إطاراً تشخيصياً لا يتطلب تدريباً (Training-free diagnostic framework) يعمل بأعلى دقة ممكنة: على مستوى كل رمز، وكل سؤال، وكل معلم.نشتق تدرجاً مثالياً لكل عقدة (Ideal per-node gradient)، مُعّرفاً بأنه تحديث المعلمة الذي يزيد من احتمالية نجاح الطالب (Student) إلى أقصى حد. ثم نطور خوارزمية فعّالة ومنظمة موجهة بالتوسع (Scalable targeted-rollout algorithm) لتقدير هذا التدرج بكفاءة، حتى في سلاسل الأفكار الوسيطة الطويلة. تُقاس درجة مواءمة التدرج (Gradient alignment score)، والمعروفة بأنها التشابه الجيبي (Cosine similarity) بين هذا التدرج المثلي والتدرج الناتج عن أي عملية تقريب محددة، إلى أي مدى يقترب تكوين معين من الإشارة المثالية.عبر مجموعة متنوعة من إعدادات التقريب الذاتي ونماذج المعلمين الخارجية، نلاحظ أن إرشادات التقريب تتوافق بشكل أكبر بكثير مع الإشارة المثلية في عمليات الخروج الخاطئة (Incorrect rollouts) مقارنة بالصحيحة، حيث يكون الطالب يؤدي أداءً جيداً بالفعل في الحالات الصحيحة، بينما تميل إشارة المعلم إلى أن تصبح ضوضائية (Noisy). علاوة على ذلك، نجد أن سياق التقريب الأمثل يعتمد بشكل مشترك على قدرة نموذج الطالب والمهمة المستهدفة، ولا يظهر تكوين واحد فعال عالمياً. تحفز هذه النتائج استخدام التحليلات التشخيصية المعتمدة على المهمة والرمز (Per-task, per-token diagnostic analyses) في عمليات التقريب.

One-sentence Summary

The authors introduce a training-free diagnostic framework for on-policy distillation in reasoning models that derives an ideal per-node gradient and employs a scalable targeted-rollout algorithm to estimate it, using the gradient alignment score to reveal that distillation guidance aligns more strongly on incorrect rollouts and that the optimal context depends on student capacity and task, motivating per-task, per-token diagnostic analyses for distillation.

Key Contributions

  • The paper introduces a training-free diagnostic framework operating at per-token resolution that derives an ideal per-node gradient and develops a scalable targeted-rollout algorithm for efficient estimation. A gradient alignment score is defined to quantify the extent to which a specific distillation configuration approximates this ideal signal.
  • Empirical analysis across various self-distillation settings and external teacher models shows that distillation guidance aligns substantially higher with the ideal on incorrect rollouts compared to correct ones. Findings further demonstrate that the optimal distillation context depends on the student model's capacity and target task, indicating no single universally effective configuration exists.
  • The work provides a mechanistic explanation for distillation phenomena by showing that reward and distillation objectives share the same local structure through gradient decomposition. This unification enables direct offline comparison at token granularity without requiring additional training or models.

Introduction

On-policy distillation has become a standard post-training technique for reasoning models as it provides dense per-token supervision that complements sparse reinforcement learning rewards. Despite its utility, practitioners face unresolved challenges regarding teacher selection and context design because existing evaluation relies on costly training runs where aggregate metrics obscure token-level dynamics. The authors introduce a training-free diagnostic framework that assesses teacher guidance quality at the finest granularity. They derive an ideal per-node gradient based on success probability and develop a scalable targeted-rollout algorithm to estimate it efficiently, enabling the quantification of gradient alignment scores to identify beneficial configurations without performing additional training.

Method

The authors propose a framework to evaluate the quality of teacher guidance by measuring the alignment between the distillation gradient and an ideal gradient derived from task success. This method addresses the challenge of distinguishing reasoning-critical disagreements from stylistic variations in teacher outputs. The overall process involves estimating success probabilities, computing teacher gradients, and measuring their alignment.

Refer to the framework diagram for an overview of the three-step computation.

Estimating Success Probability and Ideal Gradient The process begins by decomposing the generation into a tree structure. Given GGG trajectories sampled from the student policy πθ\pi_{\theta}πθ, each node uuu represents a token position. By observing which rollouts reach a correct answer after choosing a specific token kkk at node uuu, the authors estimate the empirical success probability P^succk\hat{P}_{\text{succ}}^{k}P^succk. This allows them to define an ideal gradient gideal\mathbf{g}_{\text{ideal}}gideal that points toward tokens maximizing the probability of a correct outcome.

Teacher Forward Pass and Distillation Gradients Next, the method computes the gradient produced by the distillation algorithm. For Generalized Knowledge Distillation (GKD), the loss minimizes the forward KL divergence between the student and teacher distributions. The resulting gradient for token jjj at node uuu takes the form:

gjKD=Pθj(jˉ)\mathbf{g}_{j}^{\text{KD}} = P_{\theta}^{j} (\ell_{j} - \bar{\ell})gjKD=Pθj(jˉ)

where k=logPθklogPtek\ell_{k} = \log P_{\theta}^{k} - \log P_{\text{te}}^{k}k=logPθklogPtek is the per-token log-ratio. Similar forms apply to single-sample estimators and MiniLLM, allowing for a unified comparison.

Computing the Alignment Score Finally, the framework computes the alignment score Align(u)\text{Align}(u)Align(u) as the cosine similarity between the ideal gradient and the distillation gradient:

Align(u)=cos(guideal,guD)\text{Align}(u) = \cos(\mathbf{g}_{u}^{\text{ideal}}, \mathbf{g}_{u}^{\text{D}})Align(u)=cos(guideal,guD)

A positive score indicates the teacher pushes the student toward successful tokens, while a negative score implies the guidance is harmful.

Scalability and Rollout Generation To compute these estimates efficiently, the authors employ targeted rollouts rather than exhaustive sampling. They partition the generation into exponentially growing depth windows and prioritize tokens with high GKD gradient magnitude or large probability differences. The student rollouts required for this analysis are generated using specific prompting strategies. These include standard demonstrations with correct responses, prompts containing both correct and wrong examples to discourage imitation of errors, and summarized demonstrations to condense reasoning paths.

This setup ensures that the generation tree is enriched with sufficient samples to reliably estimate P^succk\hat{P}_{\text{succ}}^{k}P^succk even for less frequent tokens, enabling the alignment analysis to scale to long reasoning traces.

Experiment

Experiments assess gradient alignment between Qwen3 student models and diverse teacher configurations across reasoning benchmarks including BoolQ, MMLU, and AIME. The study finds that distillation signals are consistently more effective on incorrect reasoning paths, where teachers provide stronger guidance to steer students away from failure. Optimal teacher selection depends heavily on student capacity and task difficulty, as self-distillation favors smaller models while external teachers benefit larger ones. These results indicate that no universal distillation recipe exists because effective context design must align with the student's ability to comprehend the provided signals.

The the the table compares the effectiveness of different context configurations for student models, including self-generated demonstrations, summaries from a larger model, and combined correct and wrong examples. Results show that using only correct demonstrations generally yields better outcomes than including wrong examples. Furthermore, summaries from a larger model tend to improve performance, particularly for the larger student model on the MMLU benchmark. Including wrong demonstrations consistently leads to lower performance compared to correct-only contexts. Summaries generated by a larger model provide a performance boost, especially for the 1.7B student on MMLU. The advantage of larger model summaries is less significant on the BoolQ benchmark for both student scales.

The analysis reveals that gradient alignment is consistently stronger on incorrect reasoning paths compared to correct paths across various model scales and datasets. This indicates that the teacher's distillation signal is most beneficial when guiding the student away from failing trajectories, whereas correct paths already possess sufficient alignment with the optimal direction. Notably, weighted cosine metrics confirm this trend with high statistical significance even in settings where the mean cosine difference is not significant. Incorrect paths exhibit significantly higher gradient alignment than correct paths across all settings. Weighted cosine metrics show strong statistical significance for the incorrect path advantage even when mean cosine gaps are negligible. The teacher's gradient signal aligns more closely with the reward direction on failing trajectories than on successful ones.

The study investigates the relationship between teacher-student distributional differences and gradient alignment across varying model scales. Findings reveal that greater divergence between the teacher and student distributions consistently correlates with higher gradient alignment, while high similarity predicts lower alignment. Furthermore, the positive trend between reasoning depth and alignment is more evident in smaller models than in larger ones. Divergence metrics including KL and L2 distance consistently correlate positively with gradient alignment across all settings. Distributional similarity measured by cosine similarity shows a negative relationship with alignment, implying less useful signals when models agree. The correlation between normalized depth and alignment is stronger for the smaller student model compared to the larger model.

The the the table compares gradient alignment metrics for various teacher configurations across two student model scales. Results indicate that self-distillation methods generally yield higher alignment for the smaller 0.6B student, while external teachers become more effective for the larger 1.7B student. Additionally, alignment is consistently stronger on incorrect reasoning paths than on correct ones across most settings. Self-distillation methods yield higher alignment for the 0.6B student, whereas external teachers perform better for the 1.7B student. Gradient alignment is consistently higher on incorrect paths than on correct paths for almost all teacher configurations. Configurations that include incorrect demonstrations generally show lower alignment scores compared to those using only correct demonstrations.

The authors evaluate the impact of different in-context demonstration strategies on the Qwen3-0.6B model's performance on MMLU and BoolQ benchmarks. Results indicate that providing correct solutions as context leads to substantial accuracy gains, whereas including incorrect examples alongside correct ones significantly degrades performance. Providing correct solutions as context leads to dramatic accuracy improvements across all difficulty levels. Including incorrect demonstrations alongside correct ones consistently reduces performance compared to correct-only variants. Summarized correct demonstrations and examples from larger models yield performance comparable to raw correct demonstrations.

The study evaluates context configurations and gradient alignment dynamics across student-teacher models of varying scales. Experiments demonstrate that providing correct demonstrations or summaries from larger models enhances performance, whereas including incorrect examples consistently degrades accuracy. Furthermore, gradient alignment is significantly stronger on incorrect reasoning paths and correlates with greater distributional divergence, indicating teacher signals are most useful for correcting errors while self-distillation benefits smaller models more than external teachers.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp