HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Xu Wujiang Wentian Zhao Zhenting Wang Li Yu-Jhe Jin Can Jin Mingyu Mei Kai Wan Kun Metaxas Dimitris

EPO: Entropy-regularized Policy Optimization for LLM Agents
  Reinforcement Learning

Abstract

Training LLM agents in multi-turn environments with sparse rewards, wherecompleting a single task requires 30+ turns of interaction within an episode,presents a fundamental challenge for reinforcement learning. We identify acritical failure mode unique to this setting: the exploration-exploitationcascade failure. This cascade begins with early-stage policy prematureconvergence, where sparse feedback causes agents to commit to flawed,low-entropy strategies. Subsequently, agents enter late-stage policy collapse,where conventional entropy regularization becomes counterproductive, promotingchaotic exploration that destabilizes training. We propose Entropy-regularizedPolicy Optimization (EPO), a general framework that breaks this failure cyclethrough three synergistic mechanisms: (1) adopting entropy regularization inmulti-turn settings to enhance exploration, (2) an entropy smoothingregularizer that bounds policy entropy within historical averages to preventabrupt fluctuations, and (3) adaptive phase-based weighting that balancesexploration and exploitation across training. Our analysis justifies that EPOguarantees monotonically decreasing entropy variance while maintainingconvergence. EPO achieves up to 152% performance improvement on ScienceWorldand up to 19.8% on ALFWorld. Our work demonstrates that multi-turnsparse-reward settings require fundamentally different entropy control thantraditional RL, with broad implications for LLM agent training.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning | Papers | HyperAI