HyperAIHyperAI

Command Palette

Search for a command to run...

Apprentissage au-delà de l'enseignant : distillation policy-généralisée avec extrapolation de récompense

Wenkai Yang Weijie Liu Ruobing Xie Kai Yang Saiyong Yang Yankai Lin

Résumé

La distillation en politique (On-Policy Distillation, OPD), qui aligne la distribution des logits de l’étudiant sur celle du professeur le long des trajectoires générées par l’étudiant lui-même, a démontré des gains empiriques significatifs dans l’amélioration des performances de l’étudiant, surpassant souvent les approches de distillation hors politique (off-policy) ainsi que les paradigmes d’apprentissage par renforcement (RL). Dans ce travail, nous montrons d’abord théoriquement que l’OPD constitue un cas particulier d’RL contraint par une entropie de Kullback-Leibler (KL) dense, dans lequel la fonction de récompense et la régularisation KL sont toujours pondérées de manière égale, et dont le modèle de référence peut être n’importe quel modèle. Ensuite, nous proposons un cadre généralisé, appelé G-OPD (Generalized On-Policy Distillation), qui étend l’objectif standard de l’OPD en introduisant un modèle de référence flexible et un facteur d’échelle de récompense permettant de contrôler le poids relatif de la récompense par rapport à la régularisation KL. À travers des expériences approfondies sur des tâches de raisonnement mathématique et de génération de code, nous tirons deux nouvelles observations : (1) Fixer le facteur d’échelle de récompense à une valeur supérieure à 1 (c’est-à-dire une extrapolation de récompense), que nous désignons par ExOPD, améliore de manière cohérente les performances de l’OPD standard sur une large gamme de paires de tailles professeur-étudiant. En particulier, dans le cas où nous fusionnons les connaissances provenant d’experts de domaines différents — obtenus en appliquant un RL spécifique au domaine à un même modèle d’étudiant — dans le modèle d’étudiant initial, ExOPD permet à l’étudiant de dépasser même la borne de performance du professeur et de surpasser les enseignants spécialisés dans chaque domaine. (2) En s’appuyant sur ExOPD, nous constatons également que, dans le cadre de distillation fort-vers-faible (c’est-à-dire distiller un petit étudiant à partir d’un grand professeur), la correction de la récompense en choisissant comme modèle de référence le modèle de base du professeur avant l’application du RL conduit à un signal de récompense plus précis, améliorant ainsi davantage les performances de distillation. Toutefois, cette approche suppose l’accès à une version pré-RL du professeur, ce qui entraîne un coût computationnel accru. Nous espérons que nos résultats ouvriront de nouvelles pistes pour la recherche future sur l’OPD.

One-sentence Summary

Researchers from Renmin University and Tencent propose G-OPD, a generalized on-policy distillation framework with reward scaling and flexible reference models, enabling students to surpass teachers via reward extrapolation (ExOPD), especially in multi-expert merging and strong-to-weak settings.

Key Contributions

  • We theoretically unify on-policy distillation (OPD) with dense KL-constrained reinforcement learning, showing that OPD is a special case where reward and KL terms are equally weighted and the reference model is arbitrary, enabling a generalized framework (G-OPD) with flexible reward scaling and reference model selection.
  • We introduce ExOPD, a variant of G-OPD with reward scaling >1, which consistently outperforms standard OPD across teacher-student pairings and enables students to surpass domain-specific teachers when merging multiple RL-finetuned experts, validated on 4 math and 3 code generation benchmarks.
  • In strong-to-weak distillation, we show that using the teacher’s pre-RL model as the reference in ExOPD improves reward accuracy and distillation performance, though it requires additional compute and access to the teacher’s base model, further boosting results over standard OPD.

Introduction

The authors leverage on-policy distillation (OPD) — where a student model learns from teacher logits on its own generated trajectories — to improve LLM post-training, especially in merging domain-specific capabilities or distilling large teachers into smaller students. Prior OPD methods treat reward and KL regularization as fixed-equal components, limiting their flexibility and potential to exceed teacher performance. The authors’ main contribution is Generalized OPD (G-OPD), which introduces a reward scaling factor and flexible reference model; they show that scaling rewards above 1 (ExOPD) lets students surpass teachers, particularly in multi-teacher fusion and strong-to-weak distillation, and further improve results by using the teacher’s pre-RL model as reference — though this adds computational cost.

Top Figure

Method

The authors leverage a generalized on-policy distillation (G-OPD) framework that extends traditional knowledge distillation by incorporating dense token-level rewards and flexible regularization control. Unlike off-policy distillation, which trains the student to mimic teacher-generated trajectories without feedback from its own actions, G-OPD operates on-policy: the student generates its own responses, and the training signal is derived from the divergence between its output distribution and that of the teacher, conditioned on the student’s own rollout.

The core of G-OPD lies in its reformulation of the on-policy distillation objective using a reference model πref\pi_{\text{ref}}πref and a reward scaling factor λ\lambdaλ. Starting from the standard OPD objective, which minimizes the reverse KL divergence between the student πθ\pi_{\theta}πθ and teacher π\pi^{*}π over student-generated trajectories, the authors re-express this as a KL-constrained reinforcement learning objective. Specifically, they show that OPD is equivalent to maximizing a reward function r(x,y)=logπ(yx)πref(yx)r(x, y) = \log \frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)}r(x,y)=logπref(yx)π(yx) while penalizing deviation from πref\pi_{\text{ref}}πref via a KL term. This equivalence allows them to introduce λ\lambdaλ to modulate the relative weight of the reward versus regularization, yielding the generalized objective:

IGOPD(θ)=maxθ ExD,yπθ(x)[λlogπ(yx)πref(yx)DKL(πθ(yx)πref(yx))].\mathcal{I}_{\mathrm{G-OPD}}(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\operatorname*{max}} ~ \mathbb{E}_{x \sim D, y \sim \pi_{\theta}(\cdot|x)} \left[ \lambda \log \frac{\pi^{*}(y|x)}{\pi_{\mathrm{ref}}(y|x)} - \mathcal{D}_{\mathrm{KL}}\big( \pi_{\theta}(y|x) \, \big| \big| \, \pi_{\mathrm{ref}}(y|x) \big) \right].IGOPD(θ)=θmax ExD,yπθ(x)[λlogπref(yx)π(yx)DKL(πθ(yx)πref(yx))].

This formulation enables two key operational regimes. When 0<λ<10 < \lambda < 10<λ<1, the student’s log-probability distribution is encouraged to interpolate between the teacher and reference models — a setting the authors term “reward interpolation.” When λ>1\lambda > 1λ>1, the student is pushed beyond the teacher’s distribution by extrapolating the reward signal, which they call “reward extrapolation.” This flexibility allows practitioners to tune the student’s behavior along a spectrum from conservative imitation to aggressive optimization.

In the strong-to-weak distillation setting — where a large teacher is distilled into a smaller student — the authors propose a “reward correction” mechanism. Instead of using the student’s base model as the reference, they advocate for using the teacher’s pre-RL base model πbaseteacher\pi_{\text{base}}^{\text{teacher}}πbaseteacher, which yields a cleaner implicit reward signal aligned with the teacher’s RL training trajectory. This correction adjusts the reward from logππbasestudent\log \frac{\pi^{*}}{\pi_{\text{base}}^{\text{student}}}logπbasestudentπ to logππbaseteacher\log \frac{\pi^{*}}{\pi_{\text{base}}^{\text{teacher}}}logπbaseteacherπ, effectively compensating for architectural and capacity mismatches between teacher and student base models.

The training dynamics are governed by a policy gradient estimator derived from the G-OPD objective. The approximated gradient, computed under a zero discount factor for computational efficiency, takes the form:

θJGOPD(θ)=ExD,yπθ(x)[t=1TAtGOPDθlogπθ(ytx,y<t)],\nabla_{\boldsymbol{\theta}} \mathcal{J}_{\mathrm{G-OPD}}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{x} \sim D, \boldsymbol{y} \sim \pi_{\boldsymbol{\theta}}(\cdot|\boldsymbol{x})} \left[ \sum_{t=1}^{T} A_{t}^{\mathrm{G-OPD}} \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(\boldsymbol{y}_{t}|\boldsymbol{x}, \boldsymbol{y}_{<t}) \right],θJGOPD(θ)=ExD,yπθ(x)[t=1TAtGOPDθlogπθ(ytx,y<t)],

where the token-level advantage is defined as:

AtGOPD=(logπθ(ytx,y<t)logπ(ytx,y<t))+(λ1)(logπref(ytx,y<t)logπ(ytx,y<t)).A_{t}^{\mathrm{G-OPD}} = \big( \log \pi_{\theta}(y_{t}|\pmb{x}, \pmb{y}_{<t}) - \log \pi^{*}(y_{t}|\pmb{x}, \pmb{y}_{<t}) \big) + (\lambda - 1) \big( \log \pi_{\mathrm{ref}}(y_{t}|\pmb{x}, \pmb{y}_{<t}) - \log \pi^{*}(y_{t}|\pmb{x}, \pmb{y}_{<t}) \big).AtGOPD=(logπθ(ytx,y<t)logπ(ytx,y<t))+(λ1)(logπref(ytx,y<t)logπ(ytx,y<t)).

This advantage function encapsulates both the student-teacher mismatch and the reference-induced reward shift, enabling dense, per-token credit assignment that accelerates convergence and improves generalization.

Experiment

  • Single-teacher distillation shows that standard OPD fully recovers teacher behavior, while reward interpolation (0 < λ < 1) enables controlled trade-offs between performance and response length; reward extrapolation (λ = 1.25) consistently surpasses the teacher, though λ = 1.5 risks instability due to reward hacking.
  • In multi-teacher distillation, ExOPD with λ = 1.25 outperforms both OPD and SFT, producing a unified student that exceeds all domain teachers—unlike weight extrapolation (ExPO), which lacks consistent gains and controllability.
  • In strong-to-weak distillation, ExOPD significantly outperforms standard OPD and SFT, demonstrating that reward extrapolation can overcome knowledge gaps between large and small models.
  • Reward correction—using the teacher’s pre-RL variant as reference—further boosts ExOPD performance, though it incurs higher computational cost and requires access to additional model variants.
  • Across settings, ExOPD consistently increases response length and entropy, indicating greater output diversity, while maintaining or exceeding teacher-level accuracy.

The authors use ExOPD to distill knowledge from stronger teacher models into smaller students in a strong-to-weak setting, achieving consistent improvements over standard OPD and SFT across multiple math reasoning benchmarks. Results show that ExOPD not only outperforms baseline methods but also scales effectively with student model size, delivering larger gains for smaller students. The method demonstrates robustness to model capacity gaps, suggesting reward extrapolation can push beyond standard distillation limits even when teachers and students differ significantly in scale.

The authors use ExOPD with a reward scaling factor of 1.25 to distill knowledge from domain-specific teachers into a base model, achieving consistent performance gains over standard OPD and the original teachers across both math reasoning and code generation tasks. In multi-teacher settings, ExOPD is the only method that produces a unified student surpassing all individual domain teachers, while SFT and ExPO show limited or inconsistent improvements. Results also confirm that ExOPD’s gains are not due to insufficient teacher training, as continued RL on teachers yields smaller improvements than ExOPD with fewer steps.

The authors use ExOPD with a reward scaling factor of 1.25 to distill knowledge from domain-specific teachers into a base model, achieving performance that exceeds both the original teacher and standard OPD. Results show that even with fewer training steps, ExOPD consistently outperforms the teacher model across multiple math reasoning benchmarks, indicating its ability to push beyond the teacher’s capabilities. This improvement is not due to extended teacher training, as additional RL steps on the teacher yield smaller gains compared to ExOPD.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp