a month ago

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Xu Wujiang Wentian Zhao Zhenting Wang Li Yu-Jhe Jin Can Jin Mingyu Mei Kai Wan Kun Metaxas Dimitris

Abstract

Training LLM agents in multi-turn environments with sparse rewards, wherecompleting a single task requires 30+ turns of interaction within an episode,presents a fundamental challenge for reinforcement learning. We identify acritical failure mode unique to this setting: the exploration-exploitationcascade failure. This cascade begins with early-stage policy prematureconvergence, where sparse feedback causes agents to commit to flawed,low-entropy strategies. Subsequently, agents enter late-stage policy collapse,where conventional entropy regularization becomes counterproductive, promotingchaotic exploration that destabilizes training. We propose Entropy-regularizedPolicy Optimization (EPO), a general framework that breaks this failure cyclethrough three synergistic mechanisms: (1) adopting entropy regularization inmulti-turn settings to enhance exploration, (2) an entropy smoothingregularizer that bounds policy entropy within historical averages to preventabrupt fluctuations, and (3) adaptive phase-based weighting that balancesexploration and exploitation across training. Our analysis justifies that EPOguarantees monotonically decreasing entropy variance while maintainingconvergence. EPO achieves up to 152% performance improvement on ScienceWorldand up to 19.8% on ALFWorld. Our work demonstrates that multi-turnsparse-reward settings require fundamentally different entropy control thantraditional RL, with broad implications for LLM agent training.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Xu Wujiang Wentian Zhao Zhenting Wang Li Yu-Jhe Jin Can Jin Mingyu Mei Kai Wan Kun Metaxas Dimitris

Abstract

Build AI with AI

Hyper Newsletters