Command Palette
Search for a command to run...
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Abstract
Reinforcement learning (RL) has recently become the core paradigm foraligning and strengthening large language models (LLMs). Yet, applying RL inoff-policy settings--where stale data from past policies are used fortraining--improves sample efficiency, but remains challenging: policy entropydeclines sharply, optimization often becomes unstable and may even collapse.Through theoretical and empirical analysis, we identify two key insights: (i)an imbalance in optimization, where negative-advantage samples dominate thepolicy gradient, suppressing useful behaviors and risking gradient explosions;and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clippingmechanism in PPO-like objectives systematically blocks entropy-increasingupdates, thereby driving the policy toward over-exploitation at the expense ofexploration. Building on these insights, we propose BAlanced PolicyOptimization with Adaptive Clipping (BAPO), a simple yet effective method thatdynamically adjusts clipping bounds to adaptively re-balance positive andnegative contributions, preserve entropy, and stabilize RL optimization. Acrossdiverse off-policy scenarios--including sample replay and partial rollout--BAPOachieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025benchmarks, our 7B BAPO model surpasses open-source counterparts such asSkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-artresults among models of the same scale but also outperforms leading proprietarysystems like o3-mini and Gemini-2.5-Flash-Thinking.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.