HyperAIHyperAI

Command Palette

Search for a command to run...

Self-Distilled RLVR

Chenxu Yang Chuanyu Qin Qingyi Si Minghui Chen Naibin Gu Dingyu Yao Zheng Lin Weiping Wang Jiaqi Wang Nan Duan

초록

온-폴리시 증류(On-policy Distillation, OPD)는 LLM 커뮤니티에서 널리 채택된 학습 패러다임으로 자리 잡았습니다. 이 패러다임은 더 큰 모델을 교사 (teacher) 로 활용하여 샘플링된 각 트레젝토리 (trajectory) 에 대해 밀집되고 세밀한 신호를 제공하는 반면, 검증 가능한 보상을 통한 강화학습 (Reinforcement Learning with Verifiable Rewards, RLVR) 은 환경에서 검증 가능한 결과로부터 얻어지는 희소한 신호에만 의존합니다. 최근 커뮤니티에서는 동일한 모델이 교사와 학생 (student) 역할을 동시에 수행하는 온-폴리시 자기 증류 (On-policy Self-Distillation, OPSD) 를 탐구해 왔으며, 이 경우 교사는 참조 답변 (reference answers) 과 같은 추가적인 특권 정보 (privileged information) 를 받아 자기 진화를 가능하게 합니다. 본 논문은 오직 특권 교사로부터 유래된 학습 신호만으로는 심각한 정보 누출 (information leakage) 과 장기 학습의 불안정성을 초래함을 실증합니다. 이에 따라 우리는 자기 증류의 최적 적용 영역을 규명하고, RLSD(RLVR with Self-Distillation) 를 제안합니다. 구체적으로, RLSD 는 자기 증류를 활용하여 토큰 (token) 수준의 정책 차이 (policy differences) 를 추출하여 세밀한 업데이트 크기를 결정하는 동시에, 환경 피드백 (예: 응답의 정확성) 으로부터 신뢰할 수 있는 업데이트 방향을 유도하기 위해 RLVR 을 계속 사용합니다. 이를 통해 RLSD 는 RLVR 과 OPSD 의 강점을 동시에 활용하여 더 높은 수렴 상한선과 우수한 학습 안정성을 달성합니다.

One-sentence Summary

Researchers from the Chinese Academy of Sciences and JD.COM propose RLSD, a novel training paradigm that combines RLVR with self-distillation to determine fine-grained update magnitudes while maintaining reliable directions from environmental feedback. This approach overcomes information leakage in prior methods, achieving superior stability and faster convergence for LLM post-training.

Key Contributions

  • The paper introduces RLSD, a training paradigm that combines RLVR with self-distillation by using environmental feedback to determine reliable update directions while leveraging token-level policy differences from a privileged teacher to modulate update magnitudes.
  • This work provides a theoretical proof that information asymmetry in on-policy self-distillation creates an irreducible mutual information gap, explaining why relying solely on privileged teacher signals leads to information leakage and unstable long-term training.
  • Experimental results on reasoning tasks demonstrate that the proposed method achieves a higher convergence ceiling and superior stability compared to standard RLVR, reaching performance levels that surpass baselines trained for twice as many steps.

Introduction

Large reasoning models increasingly rely on Reinforcement Learning with Verifiable Rewards (RLVR) to optimize against checkable outcomes, yet this approach suffers from sparse sequence-level signals that fail to distinguish critical reasoning steps from filler tokens. While On-Policy Self-Distillation (OPSD) attempts to solve this by using a model's own privileged outputs as dense training signals, it introduces a fatal information asymmetry where the student learns to leak reference answers it cannot access during inference, causing performance to degrade after initial gains. The authors address this by proposing RLSD, a paradigm that decouples update direction from update magnitude by anchoring gradient directions to reliable environment rewards while using self-distillation solely to modulate the fine-grained intensity of token-level updates.

Dataset

  • Dataset Composition and Sources: The authors train their models on MMFineReason-123K, a challenging subset derived from the larger MMFineReason-1.8M corpus. This dataset focuses on multimodal reasoning problems that require both visual perception and domain knowledge.

  • Key Details and Filtering Rules: The subset was created using a difficulty-based filtering strategy. The authors performed inference on every sample in the original corpus using Qwen3-VL-4B-Thinking with four independent rollouts. They retained only the samples where the model failed on all four attempts. This conservative approach discards trivial examples to concentrate the training signal on difficult problems.

  • Usage in Training: The filtered dataset serves as the primary training data for the Qwen3-VL-8B-Instruct base model. The training setup uses a batch size of 256 with 8 rollouts sampled per prompt at a temperature of 1.0. The maximum context length is set to 8192, split evenly between a 4096 token prompt and a 4096 token response.

  • Processing and Privileged Information: Unlike other methods that require verified reasoning traces or successful rollouts as privileged context, this approach requires only the final ground-truth answer. The teacher model parameters are synchronized with the student model every 10 training steps to maintain a stable self-distillation signal. The evaluation phase utilizes five distinct benchmarks including MMMU, MathVista, MathVision, ZeroBench, and WeMath to assess performance across diverse mathematical and general reasoning capabilities.

Method

The authors propose Reinforcement Learning with Self-Distillation (RLSD) to address the limitations of standard distribution matching approaches like On-Policy Self-Distillation (OPSD). Instead of treating the teacher model as a generative target for behavioral cloning, RLSD repurposes the discrepancy between the teacher and student distributions as a token-level credit assignment signal within a policy gradient framework. This approach allows the model to leverage privileged information (such as reference solutions) to refine the magnitude of updates without compromising the direction of optimization, which remains anchored to the environment's verifiable reward.

The core mechanism of RLSD operates through a three-step process to construct a token-level advantage A^t\hat{A}_tA^t from a sequence-level advantage AAA. First, the method computes the privileged information gain Δt\Delta_tΔt at each token position ttt. This is defined as the stop-gradient difference between the log-probability of the token under the teacher context (conditioned on the question xxx and privileged information rrr) and the student context (conditioned only on xxx):

Δt=sg(logPT(yt)logPS(yt)).\Delta _ { t } = \mathbf { s g } ( \log P _ { T } ( y _ { t } ) - \log P _ { S } ( y _ { t } ) ) \, .Δt=sg(logPT(yt)logPS(yt)).

This metric isolates the marginal contribution of the privileged information to the prediction of the specific token generated by the student. A positive Δt\Delta_tΔt indicates that the privileged information supports the token, while a negative value suggests it disfavors it.

Second, the method performs direction-aware evidence reweighting. The authors construct a per-token weight wtw_twt by exponentiating the privileged information gain, modulated by the sign of the sequence-level advantage AAA:

wt=exp(sign(A)Δt)=(PT(yt)PS(yt))sign(A).w _ { t } = \exp ( \mathrm { s i g n } ( A ) \cdot \Delta _ { t } ) = \left( \frac { P _ { T } ( y _ { t } ) } { P _ { S } ( y _ { t } ) } \right) ^ { \mathrm { s i g n } ( A ) } .wt=exp(sign(A)Δt)=(PS(yt)PT(yt))sign(A).

This formulation ensures that the environment reward retains exclusive authority over the direction of the update (reinforcement vs. penalization), while the teacher's assessment modulates the relative magnitude of credit across tokens within a trajectory. When A>0A > 0A>0, tokens supported by the privileged information receive higher weights; when A<0A < 0A<0, tokens disfavored by the privileged information bear greater blame.

Refer to the framework diagram below to visualize how the sequence-level advantage and token-level weights are combined to produce the final token-level advantages.

Finally, to ensure training stability, the evidence weights are clipped to bound the maximum influence of any single token, similar to the trust-region constraints in PPO and GRPO. The final token-level advantage A^t\hat{A}_tA^t is computed as:

A^t=Aclip(wt, 1ϵw, 1+ϵw).\hat { A } _ { t } = A \cdot \mathrm { c l i p } ( w _ { t } , \ 1 - \epsilon _ { w } , \ 1 + \epsilon _ { w } ) \, .A^t=Aclip(wt, 1ϵw, 1+ϵw).

The training process follows a standard Group Relative Policy Optimization (GRPO) pipeline with this modified advantage. For each question xxx, the policy model samples a group of GGG responses. A verifier provides a binary reward for each response, from which the group-relative sequence-level advantage AAA is calculated. The model then performs an additional forward pass with the privileged information rrr to compute the teacher logits and derive the token-level weights wtw_twt. The policy parameters θ\thetaθ are updated by maximizing the objective function using the reweighted advantages A^t\hat{A}_tA^t:

LRLSD(θ)=E{1Gi=1G1y(i)t=1y(i)min[wtA(i),clip(wt,1ϵw,1+ϵw)A(i)]}.\mathcal { L } _ { \mathrm { R L S D } } ( \theta ) = \mathbb { E } \left\{ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y ^ { ( i ) } | } \sum _ { t = 1 } ^ { | y ^ { ( i ) } | } \operatorname* { m i n } \Big [ w _ { t } A ^ { ( i ) } , \, \mathrm { c l i p } ( w _ { t } , 1 - \epsilon _ { w } , 1 + \epsilon _ { w } ) \, A ^ { ( i ) } \Big ] \right\} .LRLSD(θ)=EG1i=1Gy(i)1t=1y(i)min[wtA(i),clip(wt,1ϵw,1+ϵw)A(i)].

This design allows RLSD to function as a drop-in replacement for the uniform advantage in GRPO, providing dense token-level guidance without introducing auxiliary distillation losses or requiring an external teacher model.

Experiment

  • Empirical observations reveal that OPSD-trained models progressively leak privileged information unavailable at inference, leading to performance degradation and stagnation in KL divergence, which indicates an irreducible gap preventing meaningful convergence.
  • Main results demonstrate that RLSD outperforms baseline methods including GRPO, OPSD, and SDPO across multimodal reasoning benchmarks by leveraging dense token-level credit assignment to achieve superior accuracy on complex mathematical tasks.
  • Training dynamics analysis shows that RLSD avoids the late-stage performance collapse seen in OPSD and prevents the rapid entropy collapse of GRPO by maintaining higher entropy through selective strengthening of critical reasoning tokens.
  • Case studies confirm that RLSD effectively redistributes sequence-level rewards to the token level, assigning higher credit to decisive reasoning steps and stronger blame to specific errors while down-weighting generic narration.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp