HyperAIHyperAI

Command Palette

Search for a command to run...

한 달 전
모델 훈련
LLM

대규모 언어 모델(LLMs)의 On-Policy Distillation에 대한 재고: 현상학, 메커니즘 및 레시피

초록

On-policy distillation (OPD)은 대규모 언어 모델(LLM)의 post-training에서 핵심적인 기술로 자리 잡았으나, 그 training dynamics에 대해서는 여전히 명확히 밝혀지지 않았습니다. 본 논문은 OPD의 dynamics와 메커니즘에 대한 체계적인 조사를 수행합니다.우리는 먼저 OPD의 성공 여부를 결정짓는 두 가지 조건을 규명하였습니다: (i) student 모델과 teacher 모델이 호환 가능한 사고 패턴(thinking patterns)을 공유해야 하며, (ii) 사고 패턴이 일치하고 더 높은 점수를 기록하더라도, teacher는 student가 training 과정에서 접했던 것 이상의 진정으로 새로운 능력을 제공해야 한다는 점입니다. 우리는 weak-to-strong reverse distillation을 통해 이러한 발견을 검증하였으며, 동일 계열의 1.5B 및 7B teacher 모델이 student의 관점에서는 분포적으로 구별 불가능(distributionally indistinguishable)하다는 것을 보여주었습니다.token-level 메커니즘을 심층 분석한 결과, 성공적인 OPD는 student가 방문한 상태(states)에서의 high-probability tokens에 대한 점진적인 alignment를 특징으로 하며, 대부분의 probability mass(97%-99%)가 집중되는 작은 공유 token set를 형성함을 확인하였습니다. 나아가 우리는 실패하는 OPD를 복구하기 위한 두 가지 실질적인 전략인 off-policy cold start와 teacher-aligned prompt selection을 제안합니다. 마지막으로, 우리는 dense token-level reward라는 OPD의 겉보기에 매력적인 '공짜 점심(free lunch)'에는 대가가 따른다는 것을 보여주며, OPD가 long-horizon distillation까지 확장(scale)될 수 있는지에 대한 의문을 제기합니다.

One-sentence Summary

Through a systematic investigation of on-policy distillation (OPD) dynamics in large language model post-training, this paper identifies that success depends on compatible thinking patterns and teacher novelty, uncovers a token-level mechanism characterized by progressive alignment on high-probability tokens, and proposes off-policy cold start and teacher-aligned prompt selection to recover failing distillation processes.

Key Contributions

  • The paper identifies two fundamental conditions for successful on-policy distillation: the requirement for compatible thinking patterns between the student and teacher, and the necessity for the teacher to provide genuinely new capabilities that the student has not encountered during training.
  • The study reveals a token-level mechanism where successful distillation is characterized by progressive alignment on high-probability tokens at student-visited states, with a small shared token set concentrating 97% to 99% of the probability mass.
  • Two practical strategies, off-policy cold start and teacher-aligned prompt selection, are introduced to recover distillation performance when the identified success conditions are not met.

Introduction

On-policy distillation (OPD) has become a vital post-training technique for large language models because it provides dense, per-token supervision that mitigates the exposure bias found in off-policy methods. However, OPD is often fragile, and practitioners frequently encounter scenarios where a stronger teacher fails to improve a student model. The authors investigate this instability by identifying that successful distillation requires both compatible thinking patterns between models and the presence of genuinely new knowledge that the student has not yet acquired. To address these failures, the authors propose two practical strategies: an off-policy cold start to bridge the thinking-pattern gap and teacher-aligned prompt selection to sharpen alignment.

Dataset

Dataset overview
Dataset overview

The authors construct several specialized datasets to facilitate cold-start distillation and controlled evaluations:

  • Cold-Start SFT Dataset: To initialize the student model, the authors sample 200,000 math prompts from the math subset of OpenThoughts3-1.2M. These prompts are processed through an offline teacher rollout using Qwen3-4B (Non-thinking) with a temperature of 0.7 and a maximum generation length of 12,288 tokens. The resulting dataset is filtered to remove incomplete or degenerate repetitive responses.
  • DeepMath Deduplicated Subset: For cross-size experiments, the authors create a version of the DeepMath subset that is deduplicated against DAPO-Math-17K to isolate in-domain prompts from those appearing in the teacher's RL post-training data. This process involves two stages:
    • Exact-match deduplication: The instruction suffixes are removed, and any DeepMath question that exactly matches a DAPO-Math-17K question is discarded.
    • Semantic deduplication: Questions are encoded using the all-mpnet-base-v2 model. Using a FAISS index, the authors identify near-duplicates by calculating cosine similarity. Any DeepMath question with a similarity score of 0.6 or higher to its nearest neighbor in DAPO-Math-17K is removed.
  • Model Training and Usage:
    • The filtered 200,000 teacher-generated pairs are used for full-parameter supervised fine-tuning (SFT) on the Qwen3-1.7B-Base model to produce Qwen3-1.7B-SFT.
    • The deduplicated DeepMath subset is used to enable a controlled comparison between prompts that overlap with teacher post-training data and those that are strictly in-domain.

Method

The authors present a framework for On-Policy Distillation (OPD), a method that transfers knowledge from a teacher language model πT\pi_TπT to a student model πθ\pi_\thetaπθ by minimizing the divergence between their next-token distributions on trajectories sampled from the current student policy. The core mechanism operates in an on-policy setting, where for each prompt xxx drawn from the dataset Dx\mathcal{D}_xDx, the student generates a response y^=(y^1,,y^T)\hat{y} = (\hat{y}_1, \ldots, \hat{y}_T)y^=(y^1,,y^T) through autoregressive sampling. At each step ttt, the student's distribution pt(v)=πθ(vx,y^<t)p_t(v) = \pi_\theta(v \mid x, \hat{y}_{<t})pt(v)=πθ(vx,y^<t) and the teacher's distribution qt(v)=πT(vx,y^<t)q_t(v) = \pi_T(v \mid x, \hat{y}_{<t})qt(v)=πT(vx,y^<t) are compared. The primary objective is to minimize the sequence-level reverse Kullback-Leibler (KL) divergence between the student's and teacher's generated trajectories, which decomposes into a sum of per-token KL divergences over the rollout.

Framework Overview
Framework Overview

As shown in the figure above, the overall framework is structured into three interconnected components. The first, "Phenomenology," identifies two empirical patterns that distinguish effective OPD: consistent thinking patterns and the observation that higher scores do not necessarily imply the acquisition of new knowledge. The central component, "Mechanism," explains why OPD works at the token level, highlighting that progressive alignment on high-probability tokens governs the process and that overlap tokens alone can suffice for effective distillation. The final component, "Recipe," proposes two strategies to rescue failing OPD instances by bridging the thinking-pattern gap, specifically off-policy cold start and teacher-aligned prompts.

The framework encompasses three distinct implementations of OPD, varying in the granularity of supervision used to compute the KL divergence. The most lightweight variant, sampled-token OPD, evaluates the divergence only at the specific token sampled by the student at each step, using the loss tsample=logpt(y^t)logqt(y^t)\ell_t^{\text{sample}} = \log p_t(\hat{y}_t) - \log q_t(\hat{y}_t)tsample=logpt(y^t)logqt(y^t). In contrast, full-vocabulary OPD computes the KL divergence over the entire vocabulary at each step, providing denser gradients but at a higher computational cost. Top-k OPD offers a middle ground by restricting the divergence computation to a subset StS_tSt of the vocabulary, typically the top-kkk tokens with the highest probability under the student's distribution. This approach approximates the full-vocabulary KL divergence while significantly reducing the number of teacher queries, focusing supervision on the student's high-probability region. The authors further define metrics to monitor the distillation process, including the overlap ratio, which measures the alignment between the student's and teacher's top-kkk sets, the overlap-token advantage, which assesses distributional agreement within the overlapping tokens, and the entropy gap, which tracks the difference in uncertainty between the two models.

Experiment

These experiments investigate the conditions and mechanisms governing On-Policy Distillation (OPD) effectiveness by comparing various teacher-student model pairings and training configurations. The results demonstrate that successful distillation depends on thinking-pattern consistency and the presence of new knowledge rather than mere benchmark performance or model scale. Mechanistically, effective OPD is driven by progressive alignment on shared high-probability tokens, a process that can be enhanced through off-policy cold starts or the use of teacher-aligned prompts.

The the the table lists the default hyperparameters used for On-Policy Distillation (OPD) training, including settings for batch size, learning rate, and token support. These parameters define the training configuration for the experiments described in the paper. OPD uses a fixed learning rate of 1e-6 and a global batch size of 64. The training employs a Student Top-K strategy with a LogProb top-K of 16. KL regularization is disabled with a coefficient of 0.0 during training.

OPD training hyperparameters
OPD training hyperparameters

The figure compares successful and failed on-policy distillation training runs, showing that successful distillation is characterized by steady alignment on high-probability tokens, while failure occurs when this alignment stagnates. The main gradients and advantages come from overlapping tokens, which drive the optimization process. Successful OPD shows increasing overlap on high-probability tokens, while failed OPD shows stagnant alignment. The primary optimization signal in OPD comes from shared tokens between student and teacher. Failure occurs when the student cannot align with the teacher's high-probability tokens, leading to weak gradients and no improvement.

OPD training dynamics comparison
OPD training dynamics comparison

The authors investigate the conditions under which on-policy distillation (OPD) succeeds, finding that thinking-pattern consistency between student and teacher models is crucial for effective knowledge transfer. Even when a teacher model outperforms the student, OPD fails if their reasoning patterns are mismatched, and successful distillation occurs only when the teacher possesses new capabilities beyond the student's existing knowledge. Thinking-pattern consistency between student and teacher is essential for successful OPD OPD fails when the teacher's reasoning pattern is incompatible with the student's, regardless of benchmark performance New knowledge acquired by the teacher through post-training enables stronger gains in OPD

OPD effectiveness depends on thinking patterns
OPD effectiveness depends on thinking patterns

The experiment compares OPD performance using different prompt templates, showing that the teacher-aligned template leads to higher accuracy and better overlap growth across benchmarks. Results indicate that aligning the prompt format with the teacher's training data enhances distillation effectiveness by improving compatibility between student and teacher thinking patterns. Using a teacher-aligned prompt template improves accuracy and overlap growth in OPD The teacher-aligned template yields consistent gains across multiple benchmarks Prompt alignment enhances student-teacher compatibility, leading to better distillation outcomes

Prompt template alignment improves OPD
Prompt template alignment improves OPD

The experiment compares distillation from two teachers with different thinking patterns into the same student model. Results show that the teacher with a more compatible thinking pattern achieves stronger performance and higher initial overlap, indicating that thinking-pattern consistency governs OPD effectiveness. Despite similar benchmark performance, the more aligned teacher produces better distillation outcomes. Distillation from a teacher with a compatible thinking pattern outperforms distillation from a mismatched teacher. Initial overlap ratio correlates with downstream performance, suggesting early pattern alignment is crucial. Performance gap persists despite converging overlap curves, indicating early mismatch reduces distillation benefit.

OPD performance comparison
OPD performance comparison

The experiments evaluate the dynamics and success factors of On-Policy Distillation (OPD) by analyzing training stability, thinking-pattern consistency, and prompt alignment. Results demonstrate that successful distillation relies on the student's ability to align with the teacher's high-probability tokens, a process driven by shared reasoning patterns. Ultimately, the effectiveness of OPD is determined by the compatibility of thinking patterns between the teacher and student, which can be further optimized through the use of teacher-aligned prompt templates.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp