HyperAIHyperAI

Command Palette

Search for a command to run...

EfficientRollout:RLロールアウトのためのシステム認識型自己推測デコーディング

Minseo Kim Minjae Lee Seunghyuk Oh Kevin Galim Donghoon Kim Coleman Hooper Harman Singh Amir Gholami Hyung Il Koo Wonjun Kang

概要

強化学習(RL)は、大規模言語モデル(LLM)における代表的なポストトレーニングパラダイムとなり、強力な推論能力とagents能力を実現している。しかし、ロールアウト生成は依然として主要なレイテンシボトルネックとなっている。これは、自己回帰サンプリングが応答を逐次デコードするためであり、また、少数のロングテール生成が完了時間を決定することが多いためである。投機的デコーディング(SD)は、このボトルネックに対処する自然な方法を提供する。SDは固定されたLLMのサービングにおいて確立された技術であり、tokensを迅速にドラフトし、ターゲットモデルの分布を維持しながら並列検証によってそれらを受け入れることでレイテンシを削減するからである。しかし、その実用的な高速化はRLロールアウトに直接適用できるわけではない。すなわち、(i) 進化していくターゲットポリシーにより、固定されたドラフト生成器はポリシーの出力分布と次第に不一致となり、(ii) ロールアウトデコーディング全体を通じてアクティブバッチサイズが縮小し、並列検証が未利用の計算資源を活用できるcompute-boundからmemory-boundへの移行を引き起こすからである。したがって、RLロールアウトの高速化には、進化していくポリシーからの長期かつ高温の生成においても効果を保つドラフト生成器と、compute-boundを回避するシステム認識型のSDの活用が不可欠である。本研究では、RLロールアウトにおけるこの課題に対処するために設計された、システム認識型の自己SDフレームワークであるEfficientRolloutを提案する。EfficientRolloutは、ターゲットモデルから量子化されたドラフト生成器を誘導(すなわち自己投機的デコーディング)し、個別のドラフト生成器に対する事前学習やオンライン適応を行うことなく、進化していくポリシーと結合した状態を維持する。さらに、システム認識型のSDトグルポリシーと、受容率を考慮したドラフト長適応を連携させることで、有益な動作レジームでのみ推測を

One-sentence Summary

EfficientRollout is a system-aware self-speculative decoding framework that accelerates large language model reinforcement learning rollouts by dynamically adapting drafters to evolving policies and optimizing verification for memory-bound execution, thereby overcoming the latency bottlenecks of autoregressive sampling and fixed-drafter speculative decoding.

Key Contributions

  • This work introduces EfficientRollout, a system-aware self-speculative decoding framework that induces a weight-quantized drafter directly from the evolving target policy at every training step to maintain high block efficiency.
  • The framework incorporates a system-aware speculative decoding toggle that restricts execution to favorable memory-bound regimes and an adaptive draft-length policy that dynamically scales the drafting budget to match real-time token acceptance rates.
  • Empirical evaluations demonstrate that the framework reduces rollout latency by up to 19.6% and end-to-end training latency by up to 12.7% compared to standard reinforcement learning stacks.

Introduction

Reinforcement learning post-training is essential for enhancing large language model reasoning and agentic capabilities, yet rollout generation remains a dominant latency bottleneck because sequential autoregressive decoding struggles to keep pace with the long responses these models produce. While speculative decoding can reduce inference latency, it faces significant hurdles in RL rollouts where the continuously evolving target policy causes fixed drafters to mismatch and shrinking batch sizes shift computation from compute-bound to memory-bound regimes where parallel verification yields diminishing returns. Prior solutions either employ history-based drafting with low block efficiency or rely on auxiliary drafters that necessitate separate pretraining and online adaptation, adding substantial system complexity. The authors propose EfficientRollout, a system-aware self-speculative decoding framework that induces a quantized drafter directly from the evolving target model to maintain alignment without external training overhead. This approach combines a regime-aware toggle that activates speculation only in beneficial conditions with adaptive draft-length control to match the drafting budget to acceptance behavior, achieving up to 19.6% reduction in rollout latency while preserving model quality.

Method

The authors propose EfficientRollout, a system-aware self-speculative decoding (SD) framework designed to accelerate reinforcement learning rollouts. This method addresses three specific challenges inherent to RL training: evolving target policies, shrinking active batch dynamics, and latency bottlenecks in the rollout tail. The framework integrates a target-induced quantized drafter, a dynamic SD toggle policy, and adaptive drafting budgets.

To handle the evolving policy, the method employs a self-SD drafter derived directly from the current target model. This ensures the drafter remains synchronized with the policy without requiring separate pretraining or online adaptation. Specifically, the authors apply lightweight Round-to-Nearest (RTN) quantization to the FFN and QKVO projection layers. This targets the dense projection time, which dominates the decoding latency in the rollout tail, thereby reducing the drafter-to-target latency ratio.

The framework also incorporates a regime-aware SD toggle policy to determine when speculative decoding is beneficial. In the early phases of rollout, the active batch size is large, often saturating compute resources and making SD slower than standard autoregressive decoding. The authors utilize a calibrated roofline model to predict speedup based on the current batch size and sequence length. SD is activated only when the predicted speedup exceeds a safety margin, typically occurring in the "rollout tail" where the batch size shrinks and the system becomes memory-bound.

Furthermore, the system features an adaptive draft-length policy. As RL training progresses, the target policy sharpens, increasing the agreement between the target and the quantized drafter. The method monitors the block efficiency and adjusts the draft length accordingly, increasing the length when acceptance is high and decreasing it when acceptance drops.

As shown in the figure below, the overall architecture integrates these components into a unified pipeline that manages rollout generation, reward calculation, and policy updates.

The detailed decoding process follows a draft-and-verify cycle. The process begins with the drafter generating a sequence of tokens. The target model then verifies these tokens in parallel.

If a token is rejected during verification, the system performs a KV cache rollback to the last accepted state. This ensures that the policy updates are based on valid trajectories while maintaining the efficiency gains from speculative decoding. The authors leverage this mechanism to reduce both rollout and end-to-end latency without altering the underlying RL distribution.

Experiment

3%) even with shorter γˉ\bar{\gamma}γˉ, reflecting limited auxiliary-drafter effectiveness in long, high-temperature RL rollouts (Sec. 5.3).

Training dynamics and quality preservation. EfficientRollout accelerates training while preserving training dynamics. the figure shows that EfficientRollout closely tracks the veRL (AR) reward trajectory on Qwen2.5-7B. This aligns with the lossless nature of SD, which preserves the target distribution [6, 29] and is especially important in RL where rollout-distribution shifts can affect training stability [15, 63]. Full reward and validation-accuracy trajectories across all evaluated models are provided in Sec. F.2.

  1. 5.3 Further Analysis of Rollout Acceleration

Regime-aware SD toggle policy. High τ\tauτ alone is insufficient for realized speedup, and SD must be activated only in system-beneficial regimes via the toggle policy πSD\pi_{\mathrm{SD}}πSD. the figure validates the calibrated roofline boundary of πSD\pi_{\mathrm{SD}}πSD by showing that the predicted speedup correctly separates SD-beneficial and SD-harmful coordinates in the (B,S)(B, S)(B,S) plane. the figure isolates the effect of πSD\pi_{\mathrm{SD}}πSD by comparing rollout latency with the same quantized self-drafter when SD is always enabled versus enabled according to πSD\pi_{\mathrm{SD}}πSD. By disabling SD during only the early 6–11% of decoding steps, the πSD\pi_{\mathrm{SD}}πSD policy yields consistently larger rollout-generation latency reductions relative to AR than always-on SD, despite slightly lower τ\tauτ. These gains therefore cannot be attributed to better draft quality; they come from avoiding verification in the initial large-batch regime,

the figure: Block efficiency across pretrained auxiliary drafters. On DAPO-Math-17K, the evaluated pretrained auxiliary drafters achieve lower block efficiency than quantized self-drafters.

where TV/TpT_{V}/T_{p}TV/Tp is high. Full validation results for πSD\pi_{\mathrm{SD}}πSD are provided in Sec. D.3.

Adaptive draft-length policy. The adaptive γ\gammaγ policy effectively tracks the latency-reducing γ\gammaγ as the evolving target policy shifts over training. the paper isolate this effect by replacing only the adaptive γ\gammaγ policy in EfficientRollout with fixed-γ\gammaγ variants throughout training. As shown in the figure, the adaptive γ\gammaγ policy achieves an overall 19.6% reduction in rollout-generation latency, compared with 13.5% and 11.8% for fixed γ=γlow=5\gamma = \gamma_{\text{low}} = 5γ=γlow=5 and γ=γhigh=11\gamma = \gamma_{\text{high}} = 11γ=γhigh=11, respectively. This improvement comes from leveraging larger latency reductions at different stages of training, as smaller and larger γ\gammaγ values are

more effective early and later in training, respectively. Rather than selecting a single fixed γ\gammaγ through post-hoc sweeps, the adaptive γ\gammaγ policy moves within the pre-specified Γ\GammaΓ using observed τ\tauτ feedback

Learned auxiliary drafting in RL workloads. The lower τ\tauτ and α\alphaα of learned auxiliary drafting in the table reflect the practical difficulty of obtaining an auxiliary drafter aligned with long, high-temperature RL rollouts. Under the prior RL-SD setup of Iso et al. [24], the paper further evaluate auxiliary drafters on DAPO-Math-17K [63], the rollout workload used in that setup. As shown in the figure, the evaluated auxiliary drafters [41, 43] achieve only τ=1.2\tau = 1.2τ=1.22.42.42.4 over the first 30 steps, whereas the quantized self-drafter consistently reaches τ=3.6\tau = 3.6τ=3.63.93.93.9 under the same setting. These results suggest that directly reusing public checkpoints or drafters pretrained with fixed-target LLM-serving recipes [32, 71] is insufficient to reliably obtain high τ\tauτ in RL rollouts. Realizing effective speedups with learned auxiliary drafting may require a more specialized auxiliary-drafter training pipeline [24, 74], such as collecting in-distribution rollouts, training drafters on target-generated synthetic data, and possibly using more aggressive online adaptation. Detailed evaluation settings and analyses with additional auxiliary-drafter checkpoints are presented in Sec. F.6.

The authors validate a roofline-based toggle policy by comparing predicted speedup regions against empirical measurements across different models and draft lengths. The results demonstrate that the predicted boundary effectively distinguishes between configurations where speculative decoding is beneficial and those where it is harmful. As the draft length increases, the operational regime where speculative decoding provides a speedup expands significantly, covering most tested batch sizes and sequence lengths. The predicted speedup boundary accurately separates empirically beneficial and harmful configurations across all evaluated models. Increasing the draft length significantly expands the region where speculative decoding yields speedups, reducing the number of harmful configurations. The toggle policy successfully identifies and avoids high-batch regimes where speculative decoding overhead outweighs computational gains.

The authors analyze the latency breakdown of the EfficientRollout method across three open-source models of varying sizes. The results indicate that both prediction and verification times increase with model scale, while the quantization overhead remains stable. The verification time consistently exceeds prediction time by a similar margin across all configurations, and the relative cost of quantization overhead becomes smaller for larger models. Verification latency is consistently higher than prediction latency across all evaluated models. Quantization overhead remains low and independent of the target model's size. The relative impact of quantization overhead decreases as the target model size increases.

The authors evaluate the block efficiency of weight-quantized self-drafters across varying draft lengths. Results show that block efficiency increases as the draft length grows, with the 8-bit quantized drafter consistently achieving higher efficiency than the 4-bit version. Block efficiency improves as the draft length increases for both drafter variants. The 8-bit quantized drafter outperforms the 4-bit drafter in efficiency metrics. Both drafters maintain strong efficiency at the lowest tested draft length.

The the the table presents block efficiency metrics for auxiliary drafters across sequence length ranges. The data shows that for smaller models (Qwen2.5-7B and Llama3.1-8B), efficiency decreases as sequence length increases, indicating reduced effectiveness. In contrast, the larger model (Qwen2.5-14B) demonstrates increasing efficiency with longer sequences, although overall values remain lower than those of self-drafters. Smaller models exhibit declining block efficiency as sequence length grows. The larger model shows improved block efficiency with longer sequences. The metric values align with the described limited effectiveness of auxiliary drafters.

The experiment demonstrates that EfficientRollout provides the most significant end-to-end training acceleration across all evaluated models compared to standard autoregressive inference and various speculative decoding baselines. It achieves this by substantially reducing rollout generation latency while maintaining high token acceptance rates and preserving the target distribution. EfficientRollout consistently yields the lowest end-to-end step time across all model scales, outperforming history-based, learned auxiliary, and quantized self-drafting methods. The method sustains high acceptance rates and block efficiency, unlike learned auxiliary drafting which shows lower alignment and inconsistent performance. History-based drafting fails to reduce rollout generation latency despite low preparation overhead, indicating that simple prefix reuse is insufficient for speedup in this context.

The evaluation spans multiple model scales, draft lengths, and sequence ranges to validate the toggle policy, latency distribution, drafting efficiency, and end-to-end training acceleration. The roofline-based toggle policy accurately distinguishes beneficial from harmful speculative decoding configurations, with increased draft lengths expanding performance gains while circumventing high-batch overhead. Latency and efficiency breakdowns demonstrate that verification time consistently exceeds prediction time, quantization costs remain negligible regardless of model size, and self-drafters maintain superior efficiency over auxiliary methods across varying conditions. Collectively, these findings confirm that EfficientRollout achieves significant end-to-end training acceleration by reducing rollout latency and preserving target distribution alignment, consistently outperforming standard and speculative decoding baselines.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています