HyperAIHyperAI

Command Palette

Search for a command to run...

FIPO: Future-KL 영향을 받는 정책 최적화를 통한 심층 추론 유도

Qwen Pilot Team

초록

본 연구에서는 대규모 언어 모델(LLM)의 추론 병목 현상을 극복하기 위해 설계된 강화 학습 알고리즘인 Future-KL Influenced Policy Optimization (FIPO)를 제안합니다. GRPO 스타일의 학습 방식은 효과적인 확장성을 보여주지만, 일반적으로 궤적(trajectory) 내의 모든 token에 전역적 이점(global advantage)을 균등하게 배분하는 결과 기반 보상(outcome-based rewards, ORM)에 의존합니다. 본 연구에서는 이러한 조립식 신용 할당(coarse-grained credit assignment) 방식이 결정적인 논리적 전환점(logical pivots)과 사소한 token을 구분하지 못함으로써 성능의 한계를 초래한다고 주장합니다.FIPO는 이를 해결하기 위해 정책 업데이트(policy update) 과정에 할인된 미래 KL divergence를 통합하여, 후속 궤적 행동에 미치는 영향력에 따라 token의 가중치를 재조정하는 조밀한 이점 공식화(dense advantage formulation)를 구현합니다. 실험적 결과, FIPO는 표준 베이스라인 모델에서 나타나는 추론 길이 정체 현상을 타파할 수 있음을 입증했습니다. Qwen2.5-32B를 대상으로 평가한 결과, FIPO는 평균 Chain-of-Thought(CoT) 길이를 약 4,000개에서 10,000개 이상의 tokens로 확장시켰으며, AIME 2024 Pass@1 정확도를 50.0%에서 최고 58.0%(약 56.0%에서 수렴)까지 끌어올렸습니다. 이는 DeepSeek-R1-Zero-Math-32B(~ 47.0%)와 o1-mini(~ 56.0%)를 모두 상회하는 성능입니다.본 연구 결과는 조밀한 이점 공식화(dense advantage formulations)를 구축하는 것이 ORM 기반 알고리즘을 발전시켜 베이스 모델의 추론 잠재력을 완전히 끌어내는 핵심 경로임을 시사합니다. 당사는 verl framework를 기반으로 구축된 학습 시스템을 오픈 소스로 공개합니다.

One-sentence Summary

The Qwen Pilot Team presents Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm incorporating discounted future-KL divergence to establish a dense advantage formulation that re-weights tokens by influence, replacing GRPO's coarse-grained credit assignment, enabling Qwen2.5-32B to extend average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increase AIME 2024 Pass@1 accuracy from 50.0% to 58.0%, thereby outperforming DeepSeek-R1-Zero-Math-32B and o1-mini.

Key Contributions

  • The paper introduces Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm that incorporates discounted future-KL divergence into policy updates to create a dense advantage formulation. This method re-weights tokens based on their influence on subsequent trajectory behavior to address coarse-grained credit assignment in GRPO-style training.
  • Evaluation on Qwen2.5-32B shows the approach extends average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy to a peak of 58.0%. These results outperform baselines like DeepSeek-R1-Zero-Math-32B and o1-mini while breaking through length stagnation seen in standard methods.
  • The training system is open-sourced on the verl framework to support the evolution of ORM-based algorithms. This work demonstrates that dense advantage formulations can unlock reasoning potential without relying on auxiliary value models or external knowledge priors.

Introduction

Test time scaling strategies using reinforcement learning have become essential for unlocking deep reasoning capabilities in large language models. However, standard GRPO training relies on outcome based rewards that distribute advantage uniformly across all tokens, creating a coarse grained credit assignment problem. This limitation prevents models from identifying critical logical pivots and often causes reasoning trajectories to plateau at intermediate lengths. To address this, the authors introduce Future KL Influenced Policy Optimization, which incorporates discounted future KL divergence into the policy update. This approach creates a dense advantage formulation that reweights tokens based on their influence on subsequent behavior without requiring a critic model. Empirically, this method enables models to break through length stagnation and significantly improve accuracy on complex mathematical benchmarks compared to prior baselines.

Method

FutureKL-Induced Policy Optimization (FIPO) introduces a novel reinforcement learning framework designed to address the coarse-grained credit assignment limitations found in standard Group Relative Policy Optimization (GRPO). The method transforms sparse outcome-based rewards into dense, token-level supervision by incorporating a discounted Future-KL divergence into the policy updates. The core architecture relies on three primary components: probability shift analysis, Future-KL estimation with stability mechanisms, and a re-weighted advantage objective.

The authors begin by establishing the probability shift as the fundamental unit for credit assignment. Instead of treating distributional drift as a regularization cost, the method interprets the log-space difference between the current and old policies as a directional signal of behavioral adjustment. This shift is defined as:

Δlogpt=logπθ(otq,o<t)logπθold(otq,o<t).\Delta \log p _ { t } = \log \pi _ { \theta } ( o _ { t } \mid q , o _ { < t } ) - \log \pi _ { \theta _ { \mathrm { old } } } ( o _ { t } \mid q , o _ { < t } ) .Δlogpt=logπθ(otq,o<t)logπθold(otq,o<t).

A positive shift indicates the policy reinforces a specific reasoning step, while a negative shift suggests suppression. However, relying solely on this instantaneous signal fails to capture long-term consequences. To resolve this, the framework defines Future-KL as the cumulative signed probability shift from the current step to the end of the sequence. This metric quantifies the cumulative deviation of the current policy from the reference policy for the remainder of the trajectory.

FutureKLt=k=tTΔlogpk.\mathrm { F u t u r e K L } _ { t } = \sum _ { k = t } ^ { T } \Delta \log p _ { k } .FutureKLt=k=tTΔlogpk.

Functionally, a positive Future-KL value implies the updated policy reinforces the entire subsequent trajectory, acting as a stable anchor. Conversely, a negative value signals that the trajectory stemming from the current token is becoming less favored. Empirical analysis reveals that unregulated negative signals can lead to severe training instability. As shown in the stability analysis, this collapse is accompanied by a sharp spike in the low-clip fraction and a divergence in Policy KL, indicating that accumulated negative signals can reach extreme values that destabilize the optimization process.

To mitigate this variance, the method refines the Future-KL computation by explicitly masking tokens that exceed the Dual-Clip threshold. This ensures that tokens triggering hard constraints are excluded from the Future-KL computation, preventing gradient explosion. The refined objective incorporates a binary filter MkM_kMk that evaluates to 1 only if the importance ratio remains within the threshold ccc:

FutureKLt=k=tTMkΔlogpk,Mk=I(πθ(oko<t)πold(oko<t)c).\mathrm { F u t u r e K L } _ { t } = \sum _ { k = t } ^ { T } M _ { k } \cdot \Delta \log p _ { k } , \quad M _ { k } = \mathbb { I } \left( \frac { \pi _ { \theta } ( o _ { k } | o _ { < t } ) } { \pi _ { \mathrm { old } } ( o _ { k } | o _ { < t } ) } \leq c \right) .FutureKLt=k=tTMkΔlogpk,Mk=I(πold(oko<t)πθ(oko<t)c).

Beyond stability constraints, the framework addresses the uncertainty of long-horizon generation by introducing a soft decay window. The causal dependency between the current action and future tokens diminishes as the time horizon increases. A discount factor γ(0,1]\gamma \in (0,1]γ(0,1] is incorporated to model this diminishing influence, ensuring credit assignment concentrates on the immediate reasoning chain. The final formulation used in experiments is:

FutureKLt=k=tTMkγktΔlogpk.\mathrm { F u t u r e K L } _ { t } = \sum _ { k = t } ^ { T } M _ { k } \cdot \gamma ^ { k - t } \cdot \Delta \log p _ { k } .FutureKLt=k=tTMkγktΔlogpk.

The decay rate is parameterized as γ=21τ\gamma = 2^{-\frac{1}{\tau}}γ=2τ1, where τ\tauτ controls the effective horizon. This exponential formulation creates a continuous sliding window where τ\tauτ represents the distance at which the future signal's influence attenuates by half, allowing the model to prioritize local coherence while filtering noise from the distant future.

Finally, the method integrates these mechanisms into the policy optimization objective by modulating the standard advantage estimate. The modified advantage A~t\tilde{A}_tA~t is defined using a future influence weight ftf_tft:

ft=clip(exp(FutureKLt),1ϵflow,1+ϵfhigh),A~t=A^tft.f _ { t } = \mathrm { c l i p } \left( \exp ( \mathrm { F u t u r e K L } _ { t } ) , 1 - \epsilon _ { f _ { l o w } } , 1 + \epsilon _ { f _ { h i g h } } \right) , \quad \tilde { A } _ { t } = \hat { A } _ { t } \cdot f _ { t } .ft=clip(exp(FutureKLt),1ϵflow,1+ϵfhigh),A~t=A^tft.

This formulation transforms the accumulated scalar signal from log-space to a multiplicative domain and constrains the coefficient to prevent excessive variance. When the updated policy reinforces the subsequent trajectory, the weighting term magnifies the gradient signal to encourage the current token. Conversely, when the policy suppresses the future trajectory, the term attenuates the update to reduce the reward signal for locally harmful tokens.

The final target loss adopts the token-level formulation from DAPO, maximizing the FIPO objective:

JFIPO(θ)=E(q,a)D,{oi}πθold[1i=1Goii=1Gt=1oimin(ri,tfi,tA^i,t,clip(ri,t,1ϵ,1+ϵ)fi,tA^i,t)].J _ { \mathrm { F I P O } } ( \theta ) = \mathbb { E } _ { ( q , a ) \sim \mathcal { D } , \, \{ o _ { i } \} \sim \pi _ { \theta ^ { \mathrm { o l d } } } } \left[ \frac { 1 } { \sum _ { i = 1 } ^ { G } | o _ { i } | } \sum _ { i = 1 } ^ { G } \sum _ { t = 1 } ^ { | o _ { i } | } \operatorname* { m i n } \left( r _ { i , t } f _ { i , t } \hat { A } _ { i , t } , \, \mathrm { c l i p } \left( r _ { i , t } , 1 - \epsilon , 1 + \epsilon \right) f _ { i , t } \hat { A } _ { i , t } \right) \right] .JFIPO(θ)=E(q,a)D,{oi}πθoldi=1Goi1i=1Gt=1oimin(ri,tfi,tA^i,t,clip(ri,t,1ϵ,1+ϵ)fi,tA^i,t).

Here, GGG represents the number of sampled outputs per query, ri,tr_{i,t}ri,t denotes the importance ratio, and fi,tf_{i,t}fi,t serves as the Future-KL importance weight. This approach enables dense supervision within the efficient GRPO framework, resolving the length-performance plateau observed in existing baselines.

Experiment

Evaluations on AIME benchmarks demonstrate that FIPO improves reliability and reasoning depth over the DAPO baseline. Qualitative analysis indicates that continuous expansion of response length and emergent self-reflection behaviors correlate with accuracy gains and superior optimization stability. Distinct scaling dynamics show larger models benefit from high-entropy exploration while smaller models converge to low-entropy states, confirming the method unlocks latent reasoning capabilities without compromising stability.

The authors evaluate their proposed FIPO method against baselines like GRPO and DAPO on mathematical reasoning benchmarks. The results demonstrate that FIPO consistently achieves higher Pass@1 scores than the competing approaches across both AIME 2024 and AIME 2025 datasets. This indicates a systematic improvement in reasoning reliability over the standard baseline configurations. FIPO achieves superior performance compared to GRPO and DAPO on the AIME 2024 benchmark. The method maintains a leading position over baselines on the AIME 2025 benchmark. Experimental results indicate a systematic improvement in Pass@1 scores over the DAPO baseline.

The the the table presents an ablation study evaluating the impact of influence weight clipping ranges and filtering mechanisms on the FIPO method. Results indicate that adjusting the clipping parameters to a more balanced range significantly boosts performance on the primary benchmark compared to the standard configuration. Additionally, the data confirms that the extreme value filtering mechanism is critical for achieving optimal results, as removing it leads to a noticeable decline in accuracy. A balanced influence weight clipping range yields higher accuracy than the standard configuration on the primary benchmark. The extreme value filtering mechanism is essential for maximizing performance, with the unfiltered version underperforming the filtered one. Performance gains are more pronounced on the AIME 2024 benchmark, while the more challenging AIME 2025 dataset shows consistent results across different settings.

The authors evaluate the FIPO method using different decay rate horizons to assess their impact on mathematical reasoning performance. Results show that the optimal horizon setting varies by benchmark, with the largest horizon performing best on AIME 2024 and a moderately long horizon performing best on AIME 2025. The accompanying analysis notes that while extreme values can boost scores, intermediate horizons often provide better optimization stability. The largest decay rate horizon yields the highest Pass@1 score on the AIME 2024 benchmark. A moderately long decay rate horizon achieves the best performance on the AIME 2025 benchmark. The data indicates that performance sensitivity to the decay rate differs between the two evaluation datasets.

The authors evaluate the proposed FIPO method against the DAPO baseline on the AIME 2024 and AIME 2025 mathematical reasoning benchmarks. Results indicate that FIPO systematically outperforms the baseline across all reported metrics, including average pass rates, consistency, and overall coverage. The most significant gains are observed in average accuracy and consistency, while improvements in the probability of finding at least one correct solution are more modest. FIPO consistently achieves higher average accuracy and consistency scores than the DAPO baseline on both datasets. The proposed method shows a systematic improvement in reliability metrics compared to the baseline configuration. Gains in problem coverage are positive but appear less significant compared to improvements in consistency and average performance.

The authors evaluate the FIPO method against baselines like GRPO and DAPO on AIME 2024 and AIME 2025 benchmarks, demonstrating systematic improvements in reasoning reliability and accuracy. Ablation studies validate that balanced influence weight clipping and extreme value filtering are critical for maximizing performance. Additionally, experiments on decay rate horizons reveal that optimal settings vary by benchmark to ensure optimization stability.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp