HyperAIHyperAI

Command Palette

Search for a command to run...

온-폴리시 지식 증류의 실체 규명: 어디에 도움이 되고, 어디에서 해가 되는지, 그리고 그 이유는 무엇인가

Mohammadreza Armandpour Fatih Ilhan David Harrison Ajay Jaiswal Duc N.M Hoang Farash Faghri Yizhe Zhang Minsik Cho Mehrdad Farajtabar

초록

온-policy 지식 증류(On-policy distillation)는 추론 모델 학습을 위해 밀집된 per-token 감시 신호를 제공합니다. 그러나 이러한 신호가 언제 유익하고 언제 해롭다는 것인지에 대한 조건은 아직 명확하지 않습니다. 어떤 교사 모델(Teacher model)을 사용해야 하며, 자가 지식 증류(Self-distillation)의 경우 어떤 특정 문맥이 감시 신호 역할을 해야 할까요? 최적의 선택은 토큰마다 달라질까요? 현재 이러한 질문에 답하는 것은 일반적으로 비용이 많이 드는 학습 실행을 필요로 하며, 이 과정에서 산출되는 총체적인 성능 지표는 개별 토큰 수준의 역동성을 흐릿하게 만듭니다.우리는 토큰별, 질문별, 교사 모델별로 가장 높은 해상도에서 작동하는 학습-무선(Training-free) 진단 프레임워크를 소개합니다. 우리는 학생 모델의 성공 확률을 최대화하는 파라미터 업데이트로서 정의되는 이상적인 노드별 기울기(Ideal per-node gradient)를 유도합니다. 그후 우리는 긴 중간 사고 과정(Intermediate thoughts)의 체인에서도 이 기울기를 효율적으로 추정할 수 있는 확장 가능한 타겟드 롤아웃(Targeted-rollout) 알고리즘을 개발합니다. 이상적인 기울기와 임의의 증류 기울기 간의 코사인 유사성으로 정의되는 기울기 정렬 점수(Gradient alignment score)는 특정 구성(Configuration)이 이상적인 신호를 얼마나 근사하는지를 정량화합니다.다양한 자가 지식 증류 설정과 외부 교사 모델에 걸쳐, 우리는 증류 가이드ance가 이미 성능이 우수한 상황, 즉 교사의 신호가 노이즈가 되기 쉬운 정답 롤아웃(Correct rollouts)보다 오답 롤아웃(Incorrect rollouts)에서 이상적인 신호와 훨씬 더 높은 정렬도를 보임을 관찰했습니다. 또한, 최적의 증류 문맥은 학생 모델의 용량(CAPACITY)과 목표 태스크에 jointly하게 의존하며, 단일한 보편적으로 효과적인 구성이 존재하지 않음을 발견했습니다. 이러한 발견들은 지식 증류에 있어 태스크별, 토큰별 진단 분석의 사용을 동기 부여합니다.

One-sentence Summary

The authors introduce a training-free diagnostic framework for on-policy distillation in reasoning models that derives an ideal per-node gradient and employs a scalable targeted-rollout algorithm to estimate it, using the gradient alignment score to reveal that distillation guidance aligns more strongly on incorrect rollouts and that the optimal context depends on student capacity and task, motivating per-task, per-token diagnostic analyses for distillation.

Key Contributions

  • The paper introduces a training-free diagnostic framework operating at per-token resolution that derives an ideal per-node gradient and develops a scalable targeted-rollout algorithm for efficient estimation. A gradient alignment score is defined to quantify the extent to which a specific distillation configuration approximates this ideal signal.
  • Empirical analysis across various self-distillation settings and external teacher models shows that distillation guidance aligns substantially higher with the ideal on incorrect rollouts compared to correct ones. Findings further demonstrate that the optimal distillation context depends on the student model's capacity and target task, indicating no single universally effective configuration exists.
  • The work provides a mechanistic explanation for distillation phenomena by showing that reward and distillation objectives share the same local structure through gradient decomposition. This unification enables direct offline comparison at token granularity without requiring additional training or models.

Introduction

On-policy distillation has become a standard post-training technique for reasoning models as it provides dense per-token supervision that complements sparse reinforcement learning rewards. Despite its utility, practitioners face unresolved challenges regarding teacher selection and context design because existing evaluation relies on costly training runs where aggregate metrics obscure token-level dynamics. The authors introduce a training-free diagnostic framework that assesses teacher guidance quality at the finest granularity. They derive an ideal per-node gradient based on success probability and develop a scalable targeted-rollout algorithm to estimate it efficiently, enabling the quantification of gradient alignment scores to identify beneficial configurations without performing additional training.

Method

The authors propose a framework to evaluate the quality of teacher guidance by measuring the alignment between the distillation gradient and an ideal gradient derived from task success. This method addresses the challenge of distinguishing reasoning-critical disagreements from stylistic variations in teacher outputs. The overall process involves estimating success probabilities, computing teacher gradients, and measuring their alignment.

Refer to the framework diagram for an overview of the three-step computation.

Estimating Success Probability and Ideal Gradient The process begins by decomposing the generation into a tree structure. Given GGG trajectories sampled from the student policy πθ\pi_{\theta}πθ, each node uuu represents a token position. By observing which rollouts reach a correct answer after choosing a specific token kkk at node uuu, the authors estimate the empirical success probability P^succk\hat{P}_{\text{succ}}^{k}P^succk. This allows them to define an ideal gradient gideal\mathbf{g}_{\text{ideal}}gideal that points toward tokens maximizing the probability of a correct outcome.

Teacher Forward Pass and Distillation Gradients Next, the method computes the gradient produced by the distillation algorithm. For Generalized Knowledge Distillation (GKD), the loss minimizes the forward KL divergence between the student and teacher distributions. The resulting gradient for token jjj at node uuu takes the form:

gjKD=Pθj(jˉ)\mathbf{g}_{j}^{\text{KD}} = P_{\theta}^{j} (\ell_{j} - \bar{\ell})gjKD=Pθj(jˉ)

where k=logPθklogPtek\ell_{k} = \log P_{\theta}^{k} - \log P_{\text{te}}^{k}k=logPθklogPtek is the per-token log-ratio. Similar forms apply to single-sample estimators and MiniLLM, allowing for a unified comparison.

Computing the Alignment Score Finally, the framework computes the alignment score Align(u)\text{Align}(u)Align(u) as the cosine similarity between the ideal gradient and the distillation gradient:

Align(u)=cos(guideal,guD)\text{Align}(u) = \cos(\mathbf{g}_{u}^{\text{ideal}}, \mathbf{g}_{u}^{\text{D}})Align(u)=cos(guideal,guD)

A positive score indicates the teacher pushes the student toward successful tokens, while a negative score implies the guidance is harmful.

Scalability and Rollout Generation To compute these estimates efficiently, the authors employ targeted rollouts rather than exhaustive sampling. They partition the generation into exponentially growing depth windows and prioritize tokens with high GKD gradient magnitude or large probability differences. The student rollouts required for this analysis are generated using specific prompting strategies. These include standard demonstrations with correct responses, prompts containing both correct and wrong examples to discourage imitation of errors, and summarized demonstrations to condense reasoning paths.

This setup ensures that the generation tree is enriched with sufficient samples to reliably estimate P^succk\hat{P}_{\text{succ}}^{k}P^succk even for less frequent tokens, enabling the alignment analysis to scale to long reasoning traces.

Experiment

Experiments assess gradient alignment between Qwen3 student models and diverse teacher configurations across reasoning benchmarks including BoolQ, MMLU, and AIME. The study finds that distillation signals are consistently more effective on incorrect reasoning paths, where teachers provide stronger guidance to steer students away from failure. Optimal teacher selection depends heavily on student capacity and task difficulty, as self-distillation favors smaller models while external teachers benefit larger ones. These results indicate that no universal distillation recipe exists because effective context design must align with the student's ability to comprehend the provided signals.

The the the table compares the effectiveness of different context configurations for student models, including self-generated demonstrations, summaries from a larger model, and combined correct and wrong examples. Results show that using only correct demonstrations generally yields better outcomes than including wrong examples. Furthermore, summaries from a larger model tend to improve performance, particularly for the larger student model on the MMLU benchmark. Including wrong demonstrations consistently leads to lower performance compared to correct-only contexts. Summaries generated by a larger model provide a performance boost, especially for the 1.7B student on MMLU. The advantage of larger model summaries is less significant on the BoolQ benchmark for both student scales.

The analysis reveals that gradient alignment is consistently stronger on incorrect reasoning paths compared to correct paths across various model scales and datasets. This indicates that the teacher's distillation signal is most beneficial when guiding the student away from failing trajectories, whereas correct paths already possess sufficient alignment with the optimal direction. Notably, weighted cosine metrics confirm this trend with high statistical significance even in settings where the mean cosine difference is not significant. Incorrect paths exhibit significantly higher gradient alignment than correct paths across all settings. Weighted cosine metrics show strong statistical significance for the incorrect path advantage even when mean cosine gaps are negligible. The teacher's gradient signal aligns more closely with the reward direction on failing trajectories than on successful ones.

The study investigates the relationship between teacher-student distributional differences and gradient alignment across varying model scales. Findings reveal that greater divergence between the teacher and student distributions consistently correlates with higher gradient alignment, while high similarity predicts lower alignment. Furthermore, the positive trend between reasoning depth and alignment is more evident in smaller models than in larger ones. Divergence metrics including KL and L2 distance consistently correlate positively with gradient alignment across all settings. Distributional similarity measured by cosine similarity shows a negative relationship with alignment, implying less useful signals when models agree. The correlation between normalized depth and alignment is stronger for the smaller student model compared to the larger model.

The the the table compares gradient alignment metrics for various teacher configurations across two student model scales. Results indicate that self-distillation methods generally yield higher alignment for the smaller 0.6B student, while external teachers become more effective for the larger 1.7B student. Additionally, alignment is consistently stronger on incorrect reasoning paths than on correct ones across most settings. Self-distillation methods yield higher alignment for the 0.6B student, whereas external teachers perform better for the 1.7B student. Gradient alignment is consistently higher on incorrect paths than on correct paths for almost all teacher configurations. Configurations that include incorrect demonstrations generally show lower alignment scores compared to those using only correct demonstrations.

The authors evaluate the impact of different in-context demonstration strategies on the Qwen3-0.6B model's performance on MMLU and BoolQ benchmarks. Results indicate that providing correct solutions as context leads to substantial accuracy gains, whereas including incorrect examples alongside correct ones significantly degrades performance. Providing correct solutions as context leads to dramatic accuracy improvements across all difficulty levels. Including incorrect demonstrations alongside correct ones consistently reduces performance compared to correct-only variants. Summarized correct demonstrations and examples from larger models yield performance comparable to raw correct demonstrations.

The study evaluates context configurations and gradient alignment dynamics across student-teacher models of varying scales. Experiments demonstrate that providing correct demonstrations or summaries from larger models enhances performance, whereas including incorrect examples consistently degrades accuracy. Furthermore, gradient alignment is significantly stronger on incorrect reasoning paths and correlates with greater distributional divergence, indicating teacher signals are most useful for correcting errors while self-distillation benefits smaller models more than external teachers.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp