HyperAIHyperAI

Command Palette

Search for a command to run...

VESPO: تحسين سياسة ناعمة على مستوى التسلسل التبايني لتدريب LLM غير المُتَابِع بشكل مستقر

Guobin Shen Chenxiao Zhao Xiang Cheng Lei Huang Xing Yu

الملخص

تظل استقرار التدريب تحدياً مركزياً في التعلم المعزز (RL) للنماذج اللغوية الكبيرة (LLMs). فالمُعاملة المُتقادمة للسياسة، والتدريب غير المتزامن، والتباين بين محركات التدريب والاستنتاج، كلها تؤدي إلى انحراف السياسة السلوكية عن السياسة الحالية، مما يُعرّض التدريب للانهيار. تقدم العينة المهمة تصحيحاً منهجياً لهذا الانزياح في التوزيع، لكنها تعاني من تباين عالٍ؛ كما أن الحلول الحالية مثل قص العينات على مستوى الرموز وتصحيح التوحيد على مستوى التسلسل لا تمتلك أساساً نظريًا موحداً. نقترح تحسين السياسة الناعمة على مستوى التسلسل باستخدام الصيغة التباينية (VESPO). من خلال دمج تقليل التباين في صيغة تباينية على توزيعات الاقتراح، يستخلص VESPO نواة إعادة تشكيل على شكل مغلق تعمل مباشرة على أوزان العينة على مستوى التسلسل دون الحاجة إلى تطبيع الطول. أظهرت التجارب على معايير الاستدلال الرياضي أن VESPO يحافظ على استقرار التدريب حتى عند نسب تقادم تصل إلى 64 مرة، وتنفيذ كامل غير متزامن، ويحقق مكاسب متسقة على كل من النماذج الكثيفة والنماذج المختلطة من الخبراء (Mixture-of-Experts). يمكن الوصول إلى الكود عبر: https://github.com/FloyedShen/VESPO

One-sentence Summary

Researchers from a single institution propose VESPO, a variational method that stabilizes RL training for LLMs by reducing variance in importance sampling via a closed-form sequence-level kernel, outperforming prior approaches under extreme staleness and asynchronous conditions across dense and MoE models.

Key Contributions

  • VESPO introduces a variational formulation that explicitly reduces variance in off-policy RL for LLMs, deriving a closed-form reshaping kernel for sequence-level importance weights without relying on length normalization or heuristic clipping.
  • The method preserves token dependencies and avoids bias from normalization, enabling stable training under extreme staleness (up to 64×) and fully asynchronous execution, with compatibility across both dense and Mixture-of-Experts architectures.
  • Evaluated on mathematical reasoning benchmarks, VESPO consistently improves performance and training stability compared to existing token- and sequence-level IS baselines, offering a theoretically grounded alternative to ad hoc weight transformations.

Introduction

The authors leverage reinforcement learning to stabilize off-policy training for large language models, where policy staleness and asynchronous updates cause distribution shifts that risk training collapse. Prior methods rely on heuristic token-level clipping or sequence-level normalization to control importance sampling variance, but these introduce bias or fail to preserve sequence structure. VESPO introduces a variational framework that derives a closed-form reshaping kernel operating directly on sequence-level weights, reducing variance without length normalization or approximation. It enables stable training under extreme staleness and asynchronous conditions, delivering consistent gains across both dense and MoE architectures.

Method

The authors leverage a variational framework to derive a principled importance weight transformation for off-policy policy gradient optimization in autoregressive language models. Their method, VESPO (Variational Sequence-Level Soft Policy Optimization), reinterprets any reshaping function φ(W) as inducing an implicit proposal distribution Q that mediates between the behavior policy μ and the target policy π, while enforcing variance constraints.

Refer to the framework diagram, which visualizes the variational objective: the proposal Q* is optimized to lie in a region that balances proximity to both μ and π, subject to a constraint on the expected importance weight under Q. The dual KL objective (1α)DKL(Qμ)+αDKL(Qπ)(1 - \alpha) D_{\mathrm{KL}}(Q \| \mu) + \alpha D_{\mathrm{KL}}(Q \| \pi)(1α)DKL(Qμ)+αDKL(Qπ) pulls Q toward the sampling distribution μ for efficiency and toward the target π for bias reduction. The constraint EQ[W]C\mathbb{E}_Q[W] \leq CEQ[W]C ensures the variance of the estimator remains bounded, preventing the explosive growth typical of uncorrected sequence-level importance sampling.

The closed-form solution for the optimal proposal yields the reshaping function ϕ(W)=Wαexp(λW)\phi(W) = W^{\alpha} \exp(-\lambda W)ϕ(W)=Wαexp(λW), which combines a power-law scaling with exponential suppression to smoothly attenuate extreme weights. In practice, the authors adopt a shifted variant ϕ(W)=Wc1exp(c2(1W))\phi(W) = W^{c_1} \exp(c_2(1 - W))ϕ(W)=Wc1exp(c2(1W)) to ensure ϕ(1)=1\phi(1) = 1ϕ(1)=1, preserving the scale of on-policy updates. This kernel is applied asymmetrically: distinct hyperparameters (c1,c2)(c_1, c_2)(c1,c2) are used for positive and negative advantages, allowing stronger suppression of low-weight samples when the advantage is negative, which helps prevent over-penalization of trajectories the policy already disfavors.

The gradient estimator is implemented in REINFORCE style, with the reshaping weight detached from the computation graph to serve purely as a scaling factor. To ensure numerical stability, all computations—including the sequence-level log-ratio logW=t(logπθ(ytx,y<t)logμ(ytx,y<t))\log W = \sum_t (\log \pi_\theta(y_t|x,y_{<t}) - \log \mu(y_t|x,y_{<t}))logW=t(logπθ(ytx,y<t)logμ(ytx,y<t)) and the log-transformed reshaping term c1logW+c2(1W)c_1 \log W + c_2(1 - W)c1logW+c2(1W)—are performed in log-space, with exponentiation deferred until the final step. This avoids overflow from extreme importance weights while requiring no additional memory beyond storing per-token log-probabilities under both policies.

Experiment

  • VESPO consistently outperforms baselines across multiple model scales and mathematical reasoning benchmarks, particularly excelling on MoE models where train-inference mismatch and policy staleness are amplified.
  • It demonstrates exceptional robustness to policy staleness across varying degrees of off-policy updates (N=4 to 64), maintaining stable training dynamics and high accuracy where other methods degrade or collapse.
  • Under fully asynchronous training, VESPO sustains stable convergence while baselines exhibit instability, reward collapse, or suboptimal performance due to stale rollouts.
  • VESPO tolerates train-inference mismatch without specialized fixes, matching or exceeding the stability of methods augmented with engineering solutions like Routing Replay or Truncated Importance Sampling.
  • Ablations confirm that VESPO’s sequence-level design without length normalization avoids length bias and training collapse, unlike normalized variants.
  • Its asymmetric hyperparameter design for positive and negative advantages balances suppression and learning signal, proving critical for stability and performance.

The authors evaluate VESPO under varying degrees of policy staleness and find it consistently outperforms baselines across all staleness levels, maintaining stable training and strong downstream accuracy even at extreme settings like N=64. Unlike GRPO, GSPO, and SAPO—which degrade or collapse under higher staleness—VESPO’s design avoids length bias and stabilizes updates through asymmetric importance weight suppression. Results confirm VESPO’s robustness to distribution shifts from both policy staleness and train-inference mismatch, particularly benefiting MoE models where such shifts are amplified.

The authors evaluate VESPO against several baselines across multiple model scales and mathematical reasoning benchmarks, finding that VESPO consistently achieves the highest average accuracy in all settings. Results show that VESPO’s performance advantage is most pronounced in larger MoE models, where it mitigates instability caused by distribution shifts from policy staleness and train-inference mismatch. The method’s stability and effectiveness stem from its sequence-level soft suppression of extreme importance weights and asymmetric hyperparameter design, which prevent training collapse and maintain consistent learning dynamics.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp