13 days ago

Table of Contents

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

One-sentence Summary

To accelerate reinforcement learning with verifiable rewards, researchers propose Near-Future Policy Optimization (NPO), a mixed-policy scheme that maximizes the effective learning signal by utilizing a model's own later checkpoints as auxiliary trajectories to balance trajectory quality against variance cost.

Key Contributions

The paper introduces Near-Future Policy Optimization (NPO), a mixed-policy scheme that utilizes trajectories from a later checkpoint in the same training run to guide the current policy. This approach provides auxiliary trajectories that are stronger than historical replay while remaining closer to the current distribution than external teacher models, effectively balancing trajectory quality against variance.
This work presents AutoNPO, an adaptive variant that automates the timing and selection of guidance by monitoring online training signals such as reward stagnation and entropy decline. The framework automatically triggers interventions and selects the specific guide checkpoint that maximizes an empirical estimate of the effective learning signal.
Experimental results demonstrate the effectiveness of NPO through successful manual interventions in early-stage bootstrapping and late-stage plateau breakthrough scenarios. The method shows the ability to bridge mixed-policy RLVR and self-taught approaches by using optimization time as privileged information to improve convergence and performance.

Introduction

Reinforcement learning with verifiable rewards (RLVR) is a critical post-training method for enhancing reasoning capabilities in large language models. While pure on-policy exploration is standard, it often suffers from sparse correct trajectories in early training or hits performance plateaus in later stages. Existing mixed-policy approaches attempt to solve this by either using external teachers, which introduce high distributional variance, or replaying past trajectories, which are limited by the quality of older checkpoints. The authors leverage a temporal approach called Near-Future Policy Optimization (NPO) to bridge this gap. By using a later checkpoint from the same training run as a source of auxiliary trajectories, NPO provides a learning signal that is both stronger than the current policy and closer in distribution than external models. This effectively balances signal quality against variance cost, allowing for accelerated convergence and higher performance ceilings.

Method

The authors propose Near-Future Policy Optimization (NPO), a method that enhances reinforcement learning with verification (RLVR) by leveraging trajectories from a near-future checkpoint of the same training run to guide the current policy. The core insight is that the effectiveness of off-policy guidance is governed by a trade-off between trajectory quality and variance cost. Trajectory quality $Q(\Delta)$ , which measures the fraction of prompts the source policy can solve correctly, increases with the checkpoint distance $\Delta$ from the current policy. In contrast, variance cost $V(\Delta)$ , which arises from importance weighting when combining trajectories from different policies, grows approximately exponentially with $\Delta$ . The effective learning signal $S(\Delta) = Q(\Delta)/V(\Delta)$ thus exhibits a U-shape, peaking at an optimal checkpoint distance $\Delta^*$ that balances these competing factors. As shown in the figure below, existing approaches such as far-future replay, past-trajectory replay, and external teachers occupy suboptimal regions of the quality-stability trade-off plane, while NPO targets the high- $S$ region by using a near-future policy that is both strong and close in distribution to the current policy.

The NPO framework modifies the standard RLVR training loop by introducing a selective guidance mechanism. At a training step $t$ with current policy $\pi^{(t)}$ , the method trains an additional $\Delta$ steps to obtain a near-future checkpoint $\pi^{(t+\Delta)}$ . It then rolls back to step $t$ and uses $\pi^{(t+\Delta)}$ to supply guidance trajectories for prompts where the current policy struggles. Specifically, for a prompt $x$ , the current policy samples a group of $n$ on-policy trajectories $\{o_i\}_{i=1}^n$ . The on-policy pass-rate $\hat{p}(x)$ is computed, and if $\hat{p}(x)$ is below a threshold $\tau_{\text{gate}}$ and a verified-correct guidance trajectory $o_x'$ exists in a pre-computed cache, the $n$ -th slot of the rollout group is replaced by $o_x'$ . The resulting group $\mathcal{G}_{\text{NPO}}(x)$ is used to compute group-relative advantages and update the policy via a clipped objective, with importance sampling correction applied only to the guidance slot. This process is implemented in a way that incurs no extra rollout cost during the NPO segment, as the guidance trajectories are cached offline.

To validate the approach, the authors introduce two manual interventions: early-stage bootstrapping and late-stage plateau breakthrough. The early intervention uses a short scout segment to generate a near-future checkpoint that guides the initial training phase, accelerating convergence from a cold-start regime. The late intervention uses a checkpoint from beyond a performance plateau to guide the replay of the plateaued segment, enabling the policy to break through its on-policy ceiling. These interventions demonstrate that near-future guidance is beneficial across different training stages. Building on this, the authors propose AutoNPO, an adaptive variant that automates the intervention process. AutoNPO maintains a mistake pool of prompts failed by the current policy and uses online signals from training logs to determine when to intervene and how far to roll back. The intervention trigger is based on a combination of reward stagnation and entropy drop, while the rollback distance is selected by maximizing an empirical estimate of the effective learning signal $\hat{S}(\Delta) = \hat{Q}(\Delta)/\hat{V}(\Delta)$ , where $\hat{Q}(\Delta)$ is the pass-rate of the current policy on prompts failed during the segment starting at $t-\Delta$ , and $\hat{V}(\Delta)$ is a variance proxy estimated from the per-token KL divergence between the current and the rollback policy. This adaptive controller reuses existing training signals and avoids substantial overhead, enabling a plug-and-play, objective-preserving improvement over standard RLVR.

Experiment

The researchers evaluated NPO and its variants against several reinforcement learning baselines across eight multimodal reasoning benchmarks to validate the effectiveness of targeted trajectory interventions. The results demonstrate that NPO, particularly the automated version, outperforms existing methods by providing timely guidance that prevents premature policy collapse and maintains healthy exploration. By strategically injecting high-quality trajectories, the approach successfully breaks through performance plateaus and achieves superior reasoning depth without requiring complex importance-sampling corrections.

The authors evaluate their NPO method against several baselines on multimodal reasoning benchmarks, showing that both manual and automated NPO variants consistently outperform existing reinforcement learning approaches. AutoNPO achieves the highest average accuracy and leads on multiple individual tasks, demonstrating that targeted interventions improve training dynamics without requiring exact importance-sampling correction. The results indicate that NPO's near-policy guidance enables effective, stable improvements while preserving exploration and avoiding premature policy collapse. AutoNPO achieves the highest average accuracy and leads on multiple benchmarks, outperforming all baselines including GRPO and replay-based methods. NPO maintains higher policy entropy throughout training, preventing premature collapse and supporting a higher late-stage performance ceiling. The importance-sampling correction is not necessary for NPO due to its near-policy guidance, simplifying the method without sacrificing gains.

The authors evaluate NPO and its variants against several baselines on multimodal reasoning benchmarks, using Qwen3-VL-8B-Instruct as the base model. Results show that NPO methods, particularly AutoNPO, achieve higher average performance than all baselines, with improvements driven by targeted interventions that enhance training dynamics and maintain policy exploration. NPO variants outperform all baselines across multiple benchmarks, with AutoNPO achieving the highest overall score. AutoNPO improves training dynamics by maintaining higher policy entropy and avoiding premature collapse through targeted interventions. The importance-sampling correction is found to be unnecessary in NPO due to its near-policy guidance, simplifying implementation without sacrificing performance.

The authors evaluate the NPO method and its variants against several reinforcement learning baselines using multimodal reasoning benchmarks with a Qwen3-VL base model. The experiments demonstrate that NPO, particularly the automated AutoNPO variant, consistently outperforms existing approaches by enhancing training dynamics through targeted interventions. These results suggest that near-policy guidance effectively maintains policy entropy and prevents premature collapse, allowing for stable improvements without the need for complex importance-sampling corrections.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

13 days ago

Reinforcement Learning

Model Training

Chuanyu Qin Chenxu Yang Qingyi Si Naibin Gu Dingyu Yao Zheng Lin Peng Fu Nan Duan Jiaqi Wang

Table of Contents

Abstract

One-sentence Summary

Key Contributions

The paper introduces Near-Future Policy Optimization (NPO), a mixed-policy scheme that utilizes trajectories from a later checkpoint in the same training run to guide the current policy. This approach provides auxiliary trajectories that are stronger than historical replay while remaining closer to the current distribution than external teacher models, effectively balancing trajectory quality against variance.
This work presents AutoNPO, an adaptive variant that automates the timing and selection of guidance by monitoring online training signals such as reward stagnation and entropy decline. The framework automatically triggers interventions and selects the specific guide checkpoint that maximizes an empirical estimate of the effective learning signal.
Experimental results demonstrate the effectiveness of NPO through successful manual interventions in early-stage bootstrapping and late-stage plateau breakthrough scenarios. The method shows the ability to bridge mixed-policy RLVR and self-taught approaches by using optimization time as privileged information to improve convergence and performance.

Introduction

Method

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Near-Future Policy Optimization

Chuanyu Qin Chenxu Yang Qingyi Si Naibin Gu Dingyu Yao Zheng Lin Peng Fu Nan Duan Jiaqi Wang

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Near-Future Policy Optimization

Chuanyu Qin Chenxu Yang Qingyi Si Naibin Gu Dingyu Yao Zheng Lin Peng Fu Nan Duan Jiaqi Wang

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Near-Future Policy Optimization

Chuanyu Qin Chenxu Yang Qingyi Si Naibin Gu Dingyu Yao Zheng Lin Peng Fu Nan Duan Jiaqi Wang

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters