12 days ago

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi Xin Guo Yang Nan Enyu Zhou Junrui Shen Wenxiang Chen Jiaqi Liu Jixuan Huang Zhihao Zhang Honglin Guo

Abstract

Reinforcement learning (RL) has recently become the core paradigm foraligning and strengthening large language models (LLMs). Yet, applying RL inoff-policy settings--where stale data from past policies are used fortraining--improves sample efficiency, but remains challenging: policy entropydeclines sharply, optimization often becomes unstable and may even collapse.Through theoretical and empirical analysis, we identify two key insights: (i)an imbalance in optimization, where negative-advantage samples dominate thepolicy gradient, suppressing useful behaviors and risking gradient explosions;and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clippingmechanism in PPO-like objectives systematically blocks entropy-increasingupdates, thereby driving the policy toward over-exploitation at the expense ofexploration. Building on these insights, we propose BAlanced PolicyOptimization with Adaptive Clipping (BAPO), a simple yet effective method thatdynamically adjusts clipping bounds to adaptively re-balance positive andnegative contributions, preserve entropy, and stabilize RL optimization. Acrossdiverse off-policy scenarios--including sample replay and partial rollout--BAPOachieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025benchmarks, our 7B BAPO model surpasses open-source counterparts such asSkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-artresults among models of the same scale but also outperforms leading proprietarysystems like o3-mini and Gemini-2.5-Flash-Thinking.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi Xin Guo Yang Nan Enyu Zhou Junrui Shen Wenxiang Chen Jiaqi Liu Jixuan Huang Zhihao Zhang Honglin Guo11 more

Abstract

Build AI with AI

Hyper Newsletters

Zhiheng Xi Xin Guo Yang Nan Enyu Zhou Junrui Shen Wenxiang Chen Jiaqi Liu Jixuan Huang Zhihao Zhang Honglin Guo