HyperAIHyperAI

Command Palette

Search for a command to run...

تحسين السياسات في المستقبل القريب

Chuanyu Qin Chenxu Yang Qingyi Si Naibin Gu Dingyu Yao Zheng Lin Peng Fu Nan Duan Jiaqi Wang

الملخص

أصبح التعلم المعزز ذو المكافآت القابلة للتحقق (Reinforcement learning with verifiable rewards - RLVR) وصفة جوهرية في مرحلة ما بعد التدريب (post-training). إن إدخال مسارات (trajectories) مناسبة من السياسات الخارجة عن السياسة (off-policy) ضمن استكشاف السياسة المتبعة (on-policy) يسرع من تقارب عملية RLVR ويرفع سقف الأداء، ومع ذلك، يظل إيجاد مصدر لهذه المسارات هو التحدي الرئيسي. تتبع الأساليب الحالية للسياسات المختلطة (mixed-policy) إما استيراد مسارات من معلمين خارجيين (تتميز بجودة عالية ولكنها بعيدة من حيث التوزيع الاحتمالي) أو إعادة تشغيل مسارات التدريب السابقة (قريبة من التوزيع ولكن جودتها محدودة)؛ ولا يحقق أي منهما في آن واحد الشرطين الضروريين لتعظيم إشارة التعلم الفعالة S=Q/VS = Q/VS=Q/V: أن تكون السياسة قوية بما يكفي (قيمة QQQ أعلى، ومعرفة جديدة أكثر للتعلم) وقريبة بما يكفي (قيمة VVV أقل، وسهولة أكبر في الاستيعاب).نحن نقترح "تحسين السياسة للمستقبل القريب" (Near-Future Policy Optimization - NPO)، وهو مخطط سياسة مختلط بسيط يتعلم من "الذات المستقبلية القريبة" للسياسة؛ حيث تُعد نقطة التحقق (checkpoint) اللاحقة من نفس دورة التدريب مصدراً طبيعياً للمسارات المساعدة، فهي أقوى من السياسة الحالية وأقرب من أي مصدر خارجي، مما يوازن مباشرة بين جودة المسار وتكلفة التباين (variance cost).لقد تحققنا من فعالية NPO من خلال تدخلين يدويين: "التمهيد في المراحل المبكرة" (early-stage bootstrapping) و"كسر مرحلة الثبات في المراحل المتأخرة" (late-stage plateau breakthrough). وعلاوة على ذلك، نقترح AutoNPO، وهو متغير تكيفي يقوم تلقائياً بتفعيل التدخلات بناءً على إشارات التدريب المتصلة (online training signals) ويختار نقطة التحقق الإرشادية التي تحقق أقصى قيمة لـ SSS.وعند تطبيق هذه التقنيات على نموذج Qwen3-VL-8B-Instruct باستخدام خوارزمية GRPO، نجح NPO في تحسين متوسط الأداء من 57.88 إلى 62.84، بينما دفع AutoNPO الأداء إلى 63.15، مما أدى إلى رفع سقف الأداء النهائي مع تسريع عملية التقارب.

One-sentence Summary

To accelerate reinforcement learning with verifiable rewards, researchers propose Near-Future Policy Optimization (NPO), a mixed-policy scheme that maximizes the effective learning signal by utilizing a model's own later checkpoints as auxiliary trajectories to balance trajectory quality against variance cost.

Key Contributions

  • The paper introduces Near-Future Policy Optimization (NPO), a mixed-policy scheme that utilizes trajectories from a later checkpoint in the same training run to guide the current policy. This approach provides auxiliary trajectories that are stronger than historical replay while remaining closer to the current distribution than external teacher models, effectively balancing trajectory quality against variance.
  • This work presents AutoNPO, an adaptive variant that automates the timing and selection of guidance by monitoring online training signals such as reward stagnation and entropy decline. The framework automatically triggers interventions and selects the specific guide checkpoint that maximizes an empirical estimate of the effective learning signal.
  • Experimental results demonstrate the effectiveness of NPO through successful manual interventions in early-stage bootstrapping and late-stage plateau breakthrough scenarios. The method shows the ability to bridge mixed-policy RLVR and self-taught approaches by using optimization time as privileged information to improve convergence and performance.

Introduction

Reinforcement learning with verifiable rewards (RLVR) is a critical post-training method for enhancing reasoning capabilities in large language models. While pure on-policy exploration is standard, it often suffers from sparse correct trajectories in early training or hits performance plateaus in later stages. Existing mixed-policy approaches attempt to solve this by either using external teachers, which introduce high distributional variance, or replaying past trajectories, which are limited by the quality of older checkpoints. The authors leverage a temporal approach called Near-Future Policy Optimization (NPO) to bridge this gap. By using a later checkpoint from the same training run as a source of auxiliary trajectories, NPO provides a learning signal that is both stronger than the current policy and closer in distribution than external models. This effectively balances signal quality against variance cost, allowing for accelerated convergence and higher performance ceilings.

Method

The authors propose Near-Future Policy Optimization (NPO), a method that enhances reinforcement learning with verification (RLVR) by leveraging trajectories from a near-future checkpoint of the same training run to guide the current policy. The core insight is that the effectiveness of off-policy guidance is governed by a trade-off between trajectory quality and variance cost. Trajectory quality Q(Δ)Q(\Delta)Q(Δ), which measures the fraction of prompts the source policy can solve correctly, increases with the checkpoint distance Δ\DeltaΔ from the current policy. In contrast, variance cost V(Δ)V(\Delta)V(Δ), which arises from importance weighting when combining trajectories from different policies, grows approximately exponentially with Δ\DeltaΔ. The effective learning signal S(Δ)=Q(Δ)/V(Δ)S(\Delta) = Q(\Delta)/V(\Delta)S(Δ)=Q(Δ)/V(Δ) thus exhibits a U-shape, peaking at an optimal checkpoint distance Δ\Delta^*Δ that balances these competing factors. As shown in the figure below, existing approaches such as far-future replay, past-trajectory replay, and external teachers occupy suboptimal regions of the quality-stability trade-off plane, while NPO targets the high-SSS region by using a near-future policy that is both strong and close in distribution to the current policy.

The NPO framework modifies the standard RLVR training loop by introducing a selective guidance mechanism. At a training step ttt with current policy π(t)\pi^{(t)}π(t), the method trains an additional Δ\DeltaΔ steps to obtain a near-future checkpoint π(t+Δ)\pi^{(t+\Delta)}π(t+Δ). It then rolls back to step ttt and uses π(t+Δ)\pi^{(t+\Delta)}π(t+Δ) to supply guidance trajectories for prompts where the current policy struggles. Specifically, for a prompt xxx, the current policy samples a group of nnn on-policy trajectories {oi}i=1n\{o_i\}_{i=1}^n{oi}i=1n. The on-policy pass-rate p^(x)\hat{p}(x)p^(x) is computed, and if p^(x)\hat{p}(x)p^(x) is below a threshold τgate\tau_{\text{gate}}τgate and a verified-correct guidance trajectory oxo_x'ox exists in a pre-computed cache, the nnn-th slot of the rollout group is replaced by oxo_x'ox. The resulting group GNPO(x)\mathcal{G}_{\text{NPO}}(x)GNPO(x) is used to compute group-relative advantages and update the policy via a clipped objective, with importance sampling correction applied only to the guidance slot. This process is implemented in a way that incurs no extra rollout cost during the NPO segment, as the guidance trajectories are cached offline.

To validate the approach, the authors introduce two manual interventions: early-stage bootstrapping and late-stage plateau breakthrough. The early intervention uses a short scout segment to generate a near-future checkpoint that guides the initial training phase, accelerating convergence from a cold-start regime. The late intervention uses a checkpoint from beyond a performance plateau to guide the replay of the plateaued segment, enabling the policy to break through its on-policy ceiling. These interventions demonstrate that near-future guidance is beneficial across different training stages. Building on this, the authors propose AutoNPO, an adaptive variant that automates the intervention process. AutoNPO maintains a mistake pool of prompts failed by the current policy and uses online signals from training logs to determine when to intervene and how far to roll back. The intervention trigger is based on a combination of reward stagnation and entropy drop, while the rollback distance is selected by maximizing an empirical estimate of the effective learning signal S^(Δ)=Q^(Δ)/V^(Δ)\hat{S}(\Delta) = \hat{Q}(\Delta)/\hat{V}(\Delta)S^(Δ)=Q^(Δ)/V^(Δ), where Q^(Δ)\hat{Q}(\Delta)Q^(Δ) is the pass-rate of the current policy on prompts failed during the segment starting at tΔt-\DeltatΔ, and V^(Δ)\hat{V}(\Delta)V^(Δ) is a variance proxy estimated from the per-token KL divergence between the current and the rollback policy. This adaptive controller reuses existing training signals and avoids substantial overhead, enabling a plug-and-play, objective-preserving improvement over standard RLVR.

Experiment

The researchers evaluated NPO and its variants against several reinforcement learning baselines across eight multimodal reasoning benchmarks to validate the effectiveness of targeted trajectory interventions. The results demonstrate that NPO, particularly the automated version, outperforms existing methods by providing timely guidance that prevents premature policy collapse and maintains healthy exploration. By strategically injecting high-quality trajectories, the approach successfully breaks through performance plateaus and achieves superior reasoning depth without requiring complex importance-sampling corrections.

The authors evaluate their NPO method against several baselines on multimodal reasoning benchmarks, showing that both manual and automated NPO variants consistently outperform existing reinforcement learning approaches. AutoNPO achieves the highest average accuracy and leads on multiple individual tasks, demonstrating that targeted interventions improve training dynamics without requiring exact importance-sampling correction. The results indicate that NPO's near-policy guidance enables effective, stable improvements while preserving exploration and avoiding premature policy collapse. AutoNPO achieves the highest average accuracy and leads on multiple benchmarks, outperforming all baselines including GRPO and replay-based methods. NPO maintains higher policy entropy throughout training, preventing premature collapse and supporting a higher late-stage performance ceiling. The importance-sampling correction is not necessary for NPO due to its near-policy guidance, simplifying the method without sacrificing gains.

The authors evaluate NPO and its variants against several baselines on multimodal reasoning benchmarks, using Qwen3-VL-8B-Instruct as the base model. Results show that NPO methods, particularly AutoNPO, achieve higher average performance than all baselines, with improvements driven by targeted interventions that enhance training dynamics and maintain policy exploration. NPO variants outperform all baselines across multiple benchmarks, with AutoNPO achieving the highest overall score. AutoNPO improves training dynamics by maintaining higher policy entropy and avoiding premature collapse through targeted interventions. The importance-sampling correction is found to be unnecessary in NPO due to its near-policy guidance, simplifying implementation without sacrificing performance.

The authors evaluate the NPO method and its variants against several reinforcement learning baselines using multimodal reasoning benchmarks with a Qwen3-VL base model. The experiments demonstrate that NPO, particularly the automated AutoNPO variant, consistently outperforms existing approaches by enhancing training dynamics through targeted interventions. These results suggest that near-policy guidance effectively maintains policy entropy and prevents premature collapse, allowing for stable improvements without the need for complex importance-sampling corrections.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp