2 days ago

Agentic Reinforced Policy Optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

View Paper Details View Code

Abstract

Large-scale reinforcement learning with verifiable rewards (RLVR) hasdemonstrated its effectiveness in harnessing the potential of large languagemodels (LLMs) for single-turn reasoning tasks. In realistic reasoningscenarios, LLMs can often utilize external tools to assist in task-solvingprocesses. However, current RL algorithms inadequately balance the models'intrinsic long-horizon reasoning capabilities and their proficiency inmulti-turn tool interactions. To bridge this gap, we propose Agentic ReinforcedPolicy Optimization (ARPO), a novel agentic RL algorithm tailored for trainingmulti-turn LLM-based agents. Through preliminary experiments, we observe thatLLMs tend to exhibit highly uncertain behavior, characterized by an increase inthe entropy distribution of generated tokens, immediately followinginteractions with external tools. Motivated by this observation, ARPOincorporates an entropy-based adaptive rollout mechanism, dynamically balancingglobal trajectory sampling and step-level sampling, thereby promotingexploration at steps with high uncertainty after tool usage. By integrating anadvantage attribution estimation, ARPO enables LLMs to internalize advantagedifferences in stepwise tool-use interactions. Our experiments across 13challenging benchmarks in computational reasoning, knowledge reasoning, anddeep search domains demonstrate ARPO's superiority over trajectory-level RLalgorithms. Remarkably, ARPO achieves improved performance using only half ofthe tool-use budget required by existing methods, offering a scalable solutionfor aligning LLM-based agents with real-time dynamic environments. Our code anddatasets are released at https://github.com/dongguanting/ARPO