HyperAIHyperAI

Command Palette

Search for a command to run...

자기-편자에 의한 강화 학습

초록

대규모 언어 모델은 코드와 수학과 같이 검증 가능한 도메인에서 점점 더 강화 학습을 통해 사후 훈련되고 있다. 그러나 현재의 검증 가능한 보상과 함께하는 강화 학습(RLVR) 기법은 각 시도에 대해 스칼라 형태의 결과 보상만을 이용하여 심각한 보상 할당 문제(credit-assignment bottleneck)를 겪고 있다. 실제로 많은 검증 가능한 환경은 시도가 실패한 이유를 설명하는 풍부한 텍스트 피드백(예: 런타임 오류 또는 평가자 평가)을 제공한다. 본 연구는 이러한 상황을 풍부한 피드백을 갖춘 강화 학습으로 체계화하고, 외부 교사나 명시적인 보상 모델 없이 토큰화된 피드백을 밀도 있는 학습 신호로 변환하는 자기-분산 정책 최적화(Self-Distillation Policy Optimization, SDPO)를 제안한다. SDPO는 피드백을 조건으로 한 현재 모델을 자기-교사(self-teacher)로 간주하고, 이 모델이 피드백을 반영한 다음 토큰 예측을 정책에 다시 분산시킨다. 이를 통해 SDPO는 모델이 맥락 내에서 자신의 실수를 후행적으로 식별할 수 있는 능력을 활용한다. LiveCodeBench v6에서 과학적 추론, 도구 사용, 경쟁 프로그래밍 등 다양한 과제에 대해 SDPO는 강력한 RLVR 기준 모델보다 샘플 효율성과 최종 정확도에서 우수한 성능을 보였다. 특히, 스칼라 피드백만을 반환하는 전통적인 RLVR 환경에서도, 성공적인 롤아웃을 실패한 시도에 대한 암시적 피드백으로 활용함으로써 기존 기준 모델을 능가하는 성능을 달성하였다. 마지막으로, 테스트 시 각 문제에 대해 SDPO를 개별적으로 적용하면, 어려운 이진 보상 과제에서 탐색 속도가 가속화되며, 최적의 k개 샘플링 또는 다단계 대화와 동일한 탐색 확률을 3배 적은 시도 횟수로 달성할 수 있다.

One-sentence Summary

Researchers from ETH Zurich, Max Planck, MIT, and Stanford propose SDPO, a self-distillation method that turns rich textual feedback into dense training signals for LLMs, improving sample efficiency and accuracy in code and math tasks by enabling models to learn from their own in-context error analysis without external teachers.

Key Contributions

  • SDPO introduces a novel reinforcement learning framework for verifiable domains that leverages rich textual feedback—like runtime errors or judge comments—to overcome the credit-assignment bottleneck of scalar reward-only methods, formalizing this as Reinforcement Learning with Rich Feedback (RLRF).
  • The method uses the current model as a self-teacher: after receiving feedback, it re-evaluates its own rollout token-by-token and distills its feedback-conditioned next-token predictions back into the policy, enabling dense, logit-level learning without external teachers or reward models.
  • Evaluated on LiveCodeBench v6 across scientific reasoning, tool use, and competitive programming, SDPO improves sample efficiency and final accuracy over RLVR baselines, and even boosts performance in scalar-reward settings by repurposing successful rollouts as implicit feedback, while also reducing test-time attempts by 3× compared to best-of-k sampling.

Introduction

The authors leverage reinforcement learning with rich feedback—where environments provide tokenized explanations like runtime errors or judge comments—to overcome the credit assignment bottleneck of scalar reward methods like GRPO. Prior approaches either rely on sparse rewards or require external teachers for dense supervision, limiting scalability and practicality in online learning. Their main contribution is Self-Distillation Policy Optimization (SDPO), which uses the model itself as a self-teacher: after receiving feedback, it re-evaluates its own rollout in-context and distills corrected next-token predictions back into the policy. This enables dense, logit-level credit assignment without external supervision, improving sample efficiency and final accuracy across reasoning and coding tasks—even in scalar-reward settings—while being implementable as a drop-in replacement in existing RLVR pipelines.

Method

The authors leverage a self-distillation framework, Self-Distillation Policy Optimization (SDPO), which repurposes the same policy model to serve dual roles: as a student generating initial responses and as a self-teacher that retrospectively evaluates those responses using rich feedback. This architecture enables dense, token-level credit assignment by comparing the student’s per-token probability distribution against the teacher’s distribution, conditioned on both the original question and the feedback received.

The core mechanism begins with sampling a question xxx and generating a rollout yπθ(x)y \sim \pi_{\theta}(\cdot \mid x)yπθ(x) from the student policy. The environment then provides feedback fff, which may include runtime errors, sample solutions from prior rollouts, or task-specific instructions. The self-teacher, defined as πθ(x,f)\pi_{\theta}(\cdot \mid x, f)πθ(x,f), is prompted with the same question augmented by this feedback, allowing it to reassess the student’s output with additional context. As shown in the framework diagram, this creates a closed-loop training signal where the teacher’s evaluation guides the student’s parameter updates.

The training objective minimizes the KL-divergence between the student’s next-token distribution and the teacher’s distribution at each timestep ttt, with the teacher’s parameters detached from the gradient computation to prevent degeneration. The loss is formulated as:

LSDPO(θ):=tKL(πθ(x,y<t)stopgrad(πθ(x,f,y<t)))\mathcal{L}_{\mathrm{SDPO}}(\theta) := \sum_{t} \mathrm{KL}( \pi_{\theta}(\cdot \mid x, y_{<t}) \,||\, \mathrm{stopgrad}( \pi_{\theta}(\cdot \mid x, f, y_{<t}) ) )LSDPO(θ):=tKL(πθ(x,y<t)∣∣stopgrad(πθ(x,f,y<t)))

This gradient is derived via the score function estimator and corresponds to a logit-level policy gradient where the advantage for each token y^t\hat{y}_ty^t is given by logπθ(y^tx,y<t)πθ(y^tx,f,y<t)\log \frac{\pi_{\theta}(\hat{y}_t \mid x, y_{<t})}{\pi_{\theta}(\hat{y}_t \mid x, f, y_{<t})}logπθ(y^tx,f,y<t)πθ(y^tx,y<t). Positive advantages indicate tokens the teacher deems more likely than the student, while negative advantages flag tokens where the student overconfidently diverged from the teacher’s refined judgment.

To manage computational cost, the authors approximate the full KL divergence by restricting distillation to the top-K tokens predicted by the student at each position, appending a tail term to account for the remaining probability mass. This reduces memory overhead without sacrificing performance, as most vocabulary tokens are irrelevant at any given timestep.

Training stability is enhanced through two key modifications. First, the self-teacher is regularized either via an exponential moving average (EMA) of the student’s parameters or by interpolating the current teacher with an initial reference teacher, effectively enforcing a trust region in distribution space. Second, the symmetric Jensen-Shannon divergence is adopted for the distillation loss, which has been shown to improve convergence in similar distillation settings.

The SDPO gradient can be extended to off-policy data using PPO-style clipped importance sampling, where the per-token advantage is weighted by the ratio of old and new policy probabilities. This generalization allows SDPO to leverage experience replay while maintaining the dense, feedback-driven credit assignment that distinguishes it from scalar-reward methods like GRPO.

As illustrated in the training pipeline, SDPO iteratively refines the policy: each rollout generates feedback, which is used to compute teacher log-probabilities, and the resulting per-token advantages drive the parameter update. This process repeats, with the teacher improving alongside the student, enabling the model to progressively internalize the feedback signal and correct its own errors at a granular level.

Experiment

  • SDPO outperforms GRPO in reasoning and coding tasks, achieving higher accuracy with significantly shorter response lengths, indicating more efficient reasoning.
  • In environments without rich feedback, SDPO uses successful attempts within a batch as implicit feedback, enabling faster learning and up to 10× speedup in training time on some tasks.
  • With rich feedback (e.g., coding environments), SDPO leverages dense credit assignment to improve final accuracy and reaches GRPO’s performance in 4× fewer generations, with gains amplifying as model scale increases.
  • SDPO enables test-time self-distillation, accelerating solution discovery for hard binary-reward questions by 3× compared to sampling or multi-turn baselines, even when initial success rates are near zero.
  • The self-teacher in SDPO improves during training, allowing bootstrapping from weak to strong performance, and regularization stabilizes learning without freezing the teacher.
  • SDPO avoids catastrophic forgetting better than off-policy baselines, maintaining prior capabilities while learning new tasks.
  • SDPO’s efficiency stems from sparse, token-level credit assignment that eliminates verbose, circular reasoning patterns common in GRPO.
  • Performance is tightly coupled with model scale—stronger models enable more accurate retrospection, making SDPO’s gains emergent with scale.
  • Environment feedback and sample solutions are complementary; excluding the student’s original attempt during reprompting improves exploration and diversity.

The authors use SDPO to train models on reasoning tasks without rich environment feedback, treating successful attempts within a batch as feedback for failed ones. Results show SDPO consistently outperforms GRPO across subjects and model sizes, achieving higher accuracy in less time—sometimes matching GRPO’s 5-hour performance in under an hour—and producing significantly shorter, more efficient responses. This improvement holds even when both methods use on-policy training, suggesting SDPO’s self-distillation mechanism enhances learning efficiency without requiring off-policy updates.

The authors use SDPO to train models on reasoning tasks without rich environment feedback, treating successful attempts within a batch as feedback for failed ones. Results show SDPO consistently outperforms GRPO across subjects and model sizes, achieving higher accuracy in less time and with significantly shorter response lengths, indicating more efficient reasoning. This improvement is especially pronounced on Chemistry, where SDPO reaches GRPO’s 5-hour performance in under an hour.

The authors use SDPO to significantly reduce response lengths compared to GRPO while maintaining or improving accuracy, with both Qwen3-8B and Olmo3-7B-Instruct showing a consistent 3.2× reduction in generation length. This efficiency gain reflects SDPO’s ability to produce concise, non-repetitive reasoning by leveraging dense token-level credit assignment rather than relying on verbose exploration. Results confirm that effective reasoning does not require longer outputs, and SDPO’s self-teaching mechanism enables more focused, accurate generation.

The authors use SDPO to improve performance on coding tasks with rich feedback, achieving higher final accuracy than GRPO and other baselines. Results show SDPO not only reaches GRPO’s final accuracy in fewer generations but also scales more effectively with larger models, indicating that self-teaching becomes more powerful as the base model’s in-context learning ability improves.

The authors use SDPO to evaluate how different types of feedback influence self-teaching during training, finding that combining environment output and sample solutions yields the highest student accuracy and lowest entropy. Results show that while initial teacher performance varies by feedback type, the student consistently improves beyond the teacher’s starting level, with the most effective feedback enabling the strongest gains. The self-teacher’s ability to refine its own outputs during training supports iterative policy improvement even when initial accuracy is low.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp