HyperAIHyperAI

Command Palette

Search for a command to run...

귀하의 그룹 상대적 우위는 편향되어 있습니다

초록

검증자 보상에 기반한 강화학습(Reinforcement Learning from Verifier Rewards, RLVR)은 추론 과제에 대해 대규모 언어 모델의 사후 훈련에 널리 사용되는 접근법으로 부상하였으며, GRPO 및 그 변종과 같은 그룹 기반 방법들이 광범위하게 채택되고 있다. 이러한 방법들은 학습된 비평가(critic)를 피하기 위해 그룹 간 상대적 이점 추정(group-relative advantage estimation)에 의존하지만, 그 이론적 성질에 대해서는 여전히 충분히 이해되지 않고 있다. 본 연구에서는 그룹 기반 강화학습의 근본적인 문제를 밝혀내며, 그룹 간 상대적 이점 추정기는 진정한(기대되는) 이점에 비해 본질적으로 편향된다는 점을 제시한다. 우리는 이에 대한 최초의 이론적 분석을 수행하여, 어려운 프롬프트에 대해서는 이점을 체계적으로 과소평가하고, 쉬운 프롬프트에 대해서는 과대평가함으로써 탐색과 활용의 불균형을 초래함을 입증한다. 이러한 문제를 해결하기 위해, 훈련 동적 변화와 진화하는 어려움 기준(어려움 앵커)을 기반으로 이점 추정치를 적응적으로 조정하는 History-Aware Adaptive Difficulty Weighting(HA-DW)을 제안한다. 다섯 가지 수학적 추론 벤치마크에서 수행된 이론적 분석 및 실험 결과는, HA-DW가 GRPO 및 그 변종에 통합될 때 일관되게 성능 향상을 가져옴을 보여준다. 본 연구 결과는, 편향된 이점 추정을 보정하는 것이 강력하고 효율적인 RLVR 훈련을 위해 필수적임을 시사한다.

One-sentence Summary

The authors from Beihang University, UC Berkeley, Peking University, and Meituan propose HA-DW, a history-aware adaptive difficulty weighting method that corrects the inherent bias in group-relative advantage estimation within GRPO-based reinforcement learning from verifier rewards, improving exploration-exploitation balance and boosting performance across five mathematical reasoning benchmarks.

Key Contributions

  • Group-based reinforcement learning from verifier rewards (RLVR) methods like GRPO rely on group-relative advantage estimation to avoid learned critics, but this approach suffers from an inherent bias that systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation.
  • The authors propose History-Aware Adaptive Difficulty Weighting (HA-DW), a novel reweighting scheme that dynamically adjusts advantage estimates using an evolving difficulty anchor informed by long-term reward trends and historical training dynamics to correct this bias.
  • Experiments on five mathematical reasoning benchmarks show that integrating HA-DW into GRPO and its variants consistently improves performance across model scales, even outperforming versions using more rollouts, demonstrating the critical impact of unbiased advantage estimation in RLVR.

Introduction

The authors investigate reinforcement learning from verifier rewards (RLVR), a key paradigm for post-training large language models on reasoning tasks, where group-based methods like GRPO dominate due to their simplicity and effectiveness. These methods estimate advantages within small groups of rollouts per prompt without requiring a separate critic, but prior work lacks a rigorous theoretical understanding of their underlying assumptions. The paper reveals a fundamental flaw: group-relative advantage estimation is systematically biased—underestimating true advantages for hard prompts and overestimating them for easy ones—leading to imbalanced exploration and exploitation that harms training stability and generalization. To address this, the authors propose History-Aware Adaptive Difficulty Weighting (HA-DW), a dynamic reweighting scheme that adjusts advantage estimates using an evolving difficulty anchor informed by long-term reward trends and historical training data. Theoretical analysis and experiments across five mathematical reasoning benchmarks show that HA-DW consistently improves performance when integrated into GRPO and its variants, even outperforming versions with more rollouts, demonstrating that correcting this bias is crucial for robust and efficient RLVR training.

Method

The proposed method, History-Aware Adaptive Difficulty Weighting (HA-DW), addresses the inherent bias in group-relative advantage estimation within reinforcement learning for language models. The framework operates in two primary phases: an evolving difficulty anchor that tracks the model's capability over time, and a history-aware reweighting mechanism that adjusts advantage estimates based on this evolving state. The overall architecture is designed to correct systematic underestimation for hard prompts and overestimation for easy prompts, which are common issues in group-relative policy optimization (GRPO) and its variants.

The core of the method is the evolving difficulty anchor, which models the model's solving capability as a latent belief state, denoted as CtC_tCt. This belief is updated across training batches using a Kalman-style filter. At each step ttt, the observation yty_tyt, which is the batch-level accuracy (the ratio of correct responses), is used to update the prior belief CtC_t^-Ct to the posterior belief Ct+C_t^+Ct+. The update rule is Ct+=(1ηt)Ct+ηtytC_t^+ = (1 - \eta_t) C_t^- + \eta_t y_tCt+=(1ηt)Ct+ηtyt, where ηt\eta_tηt is a dynamic forgetting factor. This factor is modulated by the model's stability, calculated as the standard deviation of the belief over the previous mmm batches. A larger standard deviation, indicating high instability or rapid capability shifts, results in a higher ηt\eta_tηt, allowing the model to adapt quickly. Conversely, a smaller standard deviation, indicating a stable model, results in a lower ηt\eta_tηt, which preserves historical information and reduces noise. This evolving belief CtC_tCt serves as a history-aware anchor for the subsequent difficulty-adaptive reweighting strategy.

The second phase, history-aware adaptive difficulty weighting, uses the evolving difficulty anchor to correct the biased advantage estimates. The history-based prompt difficulty is defined as diffthis=p^tCtdiff_t^{his} = \hat{p}_t - C_t^-diffthis=p^tCt, where p^t\hat{p}_tp^t is the empirical group baseline. This value captures the deviation of the current prompt's difficulty from the model's current capability. The direction of the adjustment is determined by the sign of the product of the sign of the estimated advantage and the sign of the history-based difficulty, Dt,i=sgn(A^t,i)sgn(diffthis)D_{t,i} = - \text{sgn}(\hat{A}_{t,i}) \cdot \text{sgn}(diff_t^{his})Dt,i=sgn(A^t,i)sgn(diffthis). This ensures that the reweighting amplifies the advantage for difficult prompts (where A^t,i\hat{A}_{t,i}A^t,i is likely underestimated) and suppresses it for easy prompts (where A^t,i\hat{A}_{t,i}A^t,i is likely overestimated). The magnitude of the adjustment is quantified by the absolute history-based difficulty, Mt=diffthisM_t = |diff_t^{his}|Mt=diffthis. The final history-aware reweighting factor is defined as Φt,i=λscaleexp(Dt,iMt)\Phi_{t,i} = \lambda_{\text{scale}} \cdot \exp(D_{t,i} \cdot M_t)Φt,i=λscaleexp(Dt,iMt), which is a smooth, multiplicative factor applied to the advantage term in the policy objective. This reweighted objective, LHA-DW(θ)L_{\text{HA-DW}}(\theta)LHA-DW(θ), is then used for policy updates, effectively mitigating the bias identified in the theoretical analysis.

Experiment

  • Evaluated HA-DW on Qwen3-4B-Base, Qwen3-8B-Base, and LLaMA-3.2-3B-Instruct across five RLVR benchmarks using GRPO, GSPO, and DAPO, demonstrating consistent performance gains over original group-relative methods.
  • On MATH500, GRPO+HA-DW achieved 3.4% higher accuracy on Hard-level prompts compared to GRPO, validating improved exploration on challenging tasks.
  • Training dynamics show HA-DW leads to higher accuracy plateaus and increased training rewards, with longer reasoning chains, indicating enhanced reasoning ability.
  • Ablation on dynamic threshold CtC_tCt confirms its superiority over fixed thresholds, with performance degradation when removed, highlighting its role in capturing long-term reward signals.
  • Empirical analysis on MATH and DAPO-Math-17k reveals underestimation of correct responses at low rollouts (8), confirming biased advantage estimation on hard prompts.
  • Theoretical analysis proves systematic bias in group-relative advantage estimation: overestimation for easy prompts (pt>0.75p_t > 0.75pt>0.75) and underestimation for hard prompts (pt<0.25p_t < 0.25pt<0.25), with bias probability exceeding 78% under G[2,8]G \in [2,8]G[2,8].
  • Extension to non-binary rewards (Beta and truncated Gaussian) confirms similar bias patterns, with magnitude increasing as prompt difficulty deviates from 0.5.
  • Ablation on group size GGG shows HA-DW outperforms larger rollouts in low-sample settings, offering a computationally efficient alternative to scaling rollouts.
  • Ablation on λscale\lambda_{\text{scale}}λscale identifies optimal values (1.3–1.5) that balance adjustment across difficulty levels, maximizing performance.

The authors use the table to quantify the probability of biased advantage estimation in group-relative reinforcement learning for hard prompts, showing that as the group size GGG increases, the probability of overestimating the baseline decreases. Results show that for G=2G=2G=2, the probability exceeds 0.999, but drops to 0.781 when G=8G=8G=8, indicating that larger group sizes reduce the likelihood of bias.

The authors use the table to analyze the probability of biased advantage estimation in group-relative reinforcement learning algorithms as the group size GGG increases. Results show that the probability P(G,0,0.5)\mathbb{P}(G, 0, 0.5)P(G,0,0.5) decreases significantly with larger GGG, indicating that the bias in advantage estimation becomes less likely as the group size grows.

The authors use a consistent set of hyperparameters across all experiments, with the primary difference being the application of HA-DW to GRPO, GSPO, and DAPO. The table shows that the only modifications introduced by HA-DW are in the learning rate and clipping thresholds, while all other settings remain identical, ensuring a fair comparison.

The authors use the table to quantify the probability of biased advantage estimation in group-relative reinforcement learning algorithms, showing that as the group size GGG increases, the probability of overestimating the advantage for hard prompts and underestimating it for easy prompts also increases. Results show that the bias becomes more pronounced with larger group sizes, with the probability reaching 0.781 for G=6G=6G=6, indicating that larger groups exacerbate the estimation bias.

Results show that applying HA-DW to group-relative reinforcement learning algorithms improves performance across five benchmarks, with models trained using HA-DW achieving higher accuracy and rewards compared to the original methods. The training dynamics indicate that HA-DW enhances exploration on hard prompts and encourages longer reasoning chains, leading to better overall performance.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp