Command Palette
Search for a command to run...
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
Xu Wujiang Wentian Zhao Zhenting Wang Li Yu-Jhe Jin Can Jin Mingyu Mei Kai Wan Kun Metaxas Dimitris

Abstract
Training LLM agents in multi-turn environments with sparse rewards, wherecompleting a single task requires 30+ turns of interaction within an episode,presents a fundamental challenge for reinforcement learning. We identify acritical failure mode unique to this setting: the exploration-exploitationcascade failure. This cascade begins with early-stage policy prematureconvergence, where sparse feedback causes agents to commit to flawed,low-entropy strategies. Subsequently, agents enter late-stage policy collapse,where conventional entropy regularization becomes counterproductive, promotingchaotic exploration that destabilizes training. We propose Entropy-regularizedPolicy Optimization (EPO), a general framework that breaks this failure cyclethrough three synergistic mechanisms: (1) adopting entropy regularization inmulti-turn settings to enhance exploration, (2) an entropy smoothingregularizer that bounds policy entropy within historical averages to preventabrupt fluctuations, and (3) adaptive phase-based weighting that balancesexploration and exploitation across training. Our analysis justifies that EPOguarantees monotonically decreasing entropy variance while maintainingconvergence. EPO achieves up to 152% performance improvement on ScienceWorldand up to 19.8% on ALFWorld. Our work demonstrates that multi-turnsparse-reward settings require fundamentally different entropy control thantraditional RL, with broad implications for LLM agent training.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.