HyperAIHyperAI

Command Palette

Search for a command to run...

VESPO: 안정적인 오프폴리시 LLM 훈련을 위한 변분 시퀀스 수준 소프트 정책 최적화

Guobin Shen Chenxiao Zhao Xiang Cheng Lei Huang Xing Yu

초록

대규모 언어 모델(LLM)을 위한 강화학습(RL)에서 훈련 안정성은 여전히 핵심적인 과제로 남아 있다. 정책의 노후화(policy staleness), 비동기적 훈련, 그리고 훈련과 추론 엔진 간의 불일치는 현재 정책으로부터 행동 정책이 벗어나게 하여 훈련 붕괴를 초래할 위험이 있다. 중요도 샘플링(importance sampling)은 이러한 분포 변화에 대해 원칙적인 보정을 제공하지만, 높은 분산(variance) 문제를 겪는다. 기존의 해결책인 토큰 수준 클리핑과 시퀀스 수준 정규화는 통합된 이론적 기반을 결여하고 있다. 본 연구에서는 변분적 시퀀스 수준 소프트 정책 최적화(Variational sEquence-level Soft Policy Optimization, VESPO)를 제안한다. 제안하는 방법은 제안 분포에 대한 변분 공식화에 분산 감소 기법을 통합하여, 길이 정규화 없이 시퀀스 수준의 중요도 가중치에 직접 작용하는 폐형 해를 갖는 리셰이핑 커널을 도출한다. 수학적 추론 벤치마크에서의 실험 결과, VESPO는 최대 64배의 노후화 비율과 완전한 비동기 실행 환경에서도 안정적인 훈련을 유지하며, 밀도 높은 모델과 Mixture-of-Experts 모델 모두에서 일관된 성능 향상을 보였다. 코드는 https://github.com/FloyedShen/VESPO 에서 공개되어 있다.

One-sentence Summary

Researchers from a single institution propose VESPO, a variational method that stabilizes RL training for LLMs by reducing variance in importance sampling via a closed-form sequence-level kernel, outperforming prior approaches under extreme staleness and asynchronous conditions across dense and MoE models.

Key Contributions

  • VESPO introduces a variational formulation that explicitly reduces variance in off-policy RL for LLMs, deriving a closed-form reshaping kernel for sequence-level importance weights without relying on length normalization or heuristic clipping.
  • The method preserves token dependencies and avoids bias from normalization, enabling stable training under extreme staleness (up to 64×) and fully asynchronous execution, with compatibility across both dense and Mixture-of-Experts architectures.
  • Evaluated on mathematical reasoning benchmarks, VESPO consistently improves performance and training stability compared to existing token- and sequence-level IS baselines, offering a theoretically grounded alternative to ad hoc weight transformations.

Introduction

The authors leverage reinforcement learning to stabilize off-policy training for large language models, where policy staleness and asynchronous updates cause distribution shifts that risk training collapse. Prior methods rely on heuristic token-level clipping or sequence-level normalization to control importance sampling variance, but these introduce bias or fail to preserve sequence structure. VESPO introduces a variational framework that derives a closed-form reshaping kernel operating directly on sequence-level weights, reducing variance without length normalization or approximation. It enables stable training under extreme staleness and asynchronous conditions, delivering consistent gains across both dense and MoE architectures.

Method

The authors leverage a variational framework to derive a principled importance weight transformation for off-policy policy gradient optimization in autoregressive language models. Their method, VESPO (Variational Sequence-Level Soft Policy Optimization), reinterprets any reshaping function φ(W) as inducing an implicit proposal distribution Q that mediates between the behavior policy μ and the target policy π, while enforcing variance constraints.

Refer to the framework diagram, which visualizes the variational objective: the proposal Q* is optimized to lie in a region that balances proximity to both μ and π, subject to a constraint on the expected importance weight under Q. The dual KL objective (1α)DKL(Qμ)+αDKL(Qπ)(1 - \alpha) D_{\mathrm{KL}}(Q \| \mu) + \alpha D_{\mathrm{KL}}(Q \| \pi)(1α)DKL(Qμ)+αDKL(Qπ) pulls Q toward the sampling distribution μ for efficiency and toward the target π for bias reduction. The constraint EQ[W]C\mathbb{E}_Q[W] \leq CEQ[W]C ensures the variance of the estimator remains bounded, preventing the explosive growth typical of uncorrected sequence-level importance sampling.

The closed-form solution for the optimal proposal yields the reshaping function ϕ(W)=Wαexp(λW)\phi(W) = W^{\alpha} \exp(-\lambda W)ϕ(W)=Wαexp(λW), which combines a power-law scaling with exponential suppression to smoothly attenuate extreme weights. In practice, the authors adopt a shifted variant ϕ(W)=Wc1exp(c2(1W))\phi(W) = W^{c_1} \exp(c_2(1 - W))ϕ(W)=Wc1exp(c2(1W)) to ensure ϕ(1)=1\phi(1) = 1ϕ(1)=1, preserving the scale of on-policy updates. This kernel is applied asymmetrically: distinct hyperparameters (c1,c2)(c_1, c_2)(c1,c2) are used for positive and negative advantages, allowing stronger suppression of low-weight samples when the advantage is negative, which helps prevent over-penalization of trajectories the policy already disfavors.

The gradient estimator is implemented in REINFORCE style, with the reshaping weight detached from the computation graph to serve purely as a scaling factor. To ensure numerical stability, all computations—including the sequence-level log-ratio logW=t(logπθ(ytx,y<t)logμ(ytx,y<t))\log W = \sum_t (\log \pi_\theta(y_t|x,y_{<t}) - \log \mu(y_t|x,y_{<t}))logW=t(logπθ(ytx,y<t)logμ(ytx,y<t)) and the log-transformed reshaping term c1logW+c2(1W)c_1 \log W + c_2(1 - W)c1logW+c2(1W)—are performed in log-space, with exponentiation deferred until the final step. This avoids overflow from extreme importance weights while requiring no additional memory beyond storing per-token log-probabilities under both policies.

Experiment

  • VESPO consistently outperforms baselines across multiple model scales and mathematical reasoning benchmarks, particularly excelling on MoE models where train-inference mismatch and policy staleness are amplified.
  • It demonstrates exceptional robustness to policy staleness across varying degrees of off-policy updates (N=4 to 64), maintaining stable training dynamics and high accuracy where other methods degrade or collapse.
  • Under fully asynchronous training, VESPO sustains stable convergence while baselines exhibit instability, reward collapse, or suboptimal performance due to stale rollouts.
  • VESPO tolerates train-inference mismatch without specialized fixes, matching or exceeding the stability of methods augmented with engineering solutions like Routing Replay or Truncated Importance Sampling.
  • Ablations confirm that VESPO’s sequence-level design without length normalization avoids length bias and training collapse, unlike normalized variants.
  • Its asymmetric hyperparameter design for positive and negative advantages balances suppression and learning signal, proving critical for stability and performance.

The authors evaluate VESPO under varying degrees of policy staleness and find it consistently outperforms baselines across all staleness levels, maintaining stable training and strong downstream accuracy even at extreme settings like N=64. Unlike GRPO, GSPO, and SAPO—which degrade or collapse under higher staleness—VESPO’s design avoids length bias and stabilizes updates through asymmetric importance weight suppression. Results confirm VESPO’s robustness to distribution shifts from both policy staleness and train-inference mismatch, particularly benefiting MoE models where such shifts are amplified.

The authors evaluate VESPO against several baselines across multiple model scales and mathematical reasoning benchmarks, finding that VESPO consistently achieves the highest average accuracy in all settings. Results show that VESPO’s performance advantage is most pronounced in larger MoE models, where it mitigates instability caused by distribution shifts from policy staleness and train-inference mismatch. The method’s stability and effectiveness stem from its sequence-level soft suppression of extreme importance weights and asymmetric hyperparameter design, which prevent training collapse and maintain consistent learning dynamics.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
VESPO: 안정적인 오프폴리시 LLM 훈련을 위한 변분 시퀀스 수준 소프트 정책 최적화 | 문서 | HyperAI초신경