HyperAIHyperAI

Command Palette

Search for a command to run...

Agent LLM à mémoire augmentée exploratoire par optimisation hybride sur et hors politique

Zeyuan Liu Jeonghye Kim Xufang Luo Dongsheng Li Yuqing Yang

Résumé

L’exploration demeure le principal goulot d’étranglement pour les agents basés sur les grands modèles linguistiques (LLM) entraînés par apprentissage par renforcement (RL). Bien que les méthodes précédentes exploitent des connaissances pré-entraînées, elles échouent dans des environnements exigeant la découverte d’états nouveaux. Nous proposons EMPO^2 (Exploratory Memory-Augmented On- and Off-Policy Optimization), un cadre hybride d’apprentissage par renforcement qui utilise une mémoire pour favoriser l’exploration et combine des mises à jour en politique et hors politique afin que les LLM performent efficacement avec mémoire, tout en garantissant une robustesse même en son absence. Sur les environnements ScienceWorld et WebShop, EMPO^2 obtient respectivement des améliorations de 128,6 % et 11,3 % par rapport à GRPO. De plus, dans des tests hors distribution, EMPO^2 démontre une supériorité en adaptabilité aux nouvelles tâches, nécessitant uniquement quelques essais avec mémoire et aucune mise à jour de paramètres. Ces résultats mettent en lumière EMPO^2 comme un cadre prometteur pour le développement d’agents basés sur les LLM plus exploratoires et généralisables.

One-sentence Summary

Microsoft Research and KAIST researchers propose EMPO², a hybrid RL framework using memory-augmented exploration and dual-policy updates to enhance LLM agents’ adaptability, outperforming GRPO by 128.6% on ScienceWorld and 11.3% on WebShop while requiring minimal trials for new tasks.

Key Contributions

  • EMPO² addresses the exploration bottleneck in LLM agents by introducing a hybrid RL framework that combines on- and off-policy learning with memory-augmented exploration, enabling agents to discover novel states beyond pretrained knowledge.
  • The method dynamically updates both parametric policy weights and non-parametric memory during training, allowing agents to bootstrap exploration and improve robustness even when memory is unavailable during deployment.
  • Evaluated on ScienceWorld and WebShop, EMPO² achieves 128.6% and 11.3% gains over GRPO, respectively, and demonstrates strong out-of-distribution adaptability with minimal trials and no parameter updates, validating its generalization capability.

Introduction

The authors leverage reinforcement learning to enhance LLM agents’ ability to explore unfamiliar environments, addressing a key gap: most current agents rely too heavily on pretrained knowledge and fail to systematically explore novel states. Prior memory-augmented methods improve short-term adaptation but stagnate due to fixed parameters, while online RL approaches like GRPO lack sufficient exploration to escape local optima. Their main contribution is EMPO², a hybrid on- and off-policy algorithm that jointly updates both model parameters and external memory, enabling continuous exploration, faster convergence, and strong few-shot generalization across tasks like ScienceWorld and WebShop.

Dataset

  • The authors evaluate EMPO² on ScienceWorld and WebShop, two publicly available research benchmarks that contain no private or sensitive data.
  • Both datasets are used as-is, with no modifications or additional data collection; all usage complies with original licenses and community standards.
  • The study is confined to simulated environments to avoid risks associated with deploying online RL systems in safety-critical real-world settings.
  • For real-world applications, the authors emphasize that LLM-generated responses must undergo stricter scrutiny to ensure safety.
  • No dataset-specific preprocessing, cropping, or metadata construction is mentioned; the focus is on benchmark evaluation within existing environments.

Method

The authors leverage a hybrid learning framework called Exploratory Memory-augmented On- and Off-Policy Optimization (EMPO²) to address exploration challenges in online reinforcement learning with large language models. The method integrates both parametric updates—via policy gradient optimization—and non-parametric updates—via an external memory buffer that stores self-generated reflective tips. This dual mechanism enables the agent to learn from past experiences while gradually internalizing guidance into its policy parameters.

At the core of EMPO² is a memory buffer M\mathcal{M}M that stores tips generated by the policy πθ\pi_\thetaπθ itself after each episode. When an episode terminates, the policy is prompted to reflect on the final state and task, producing a tip such as “You focused on the red light bulb but did not complete the task of turning it on.” These tips are stored in memory and later retrieved during rollouts to condition action generation. The retrieval operator Retr(st;M)\text{Retr}(s_t; \mathcal{M})Retr(st;M) selects up to 10 tips most relevant to the current state sts_tst, typically via cosine similarity in an embedding space. This memory-augmented prompting allows the agent to avoid repeated mistakes and explore novel strategies.

Refer to the framework diagram, which illustrates the three learning modes enabled by combining two rollout modes and two update modes. In the first mode, the agent performs prompting without memory, generating actions conditioned only on the current state and task: at+1πθ(st,u)a_{t+1} \sim \pi_\theta(\cdot \mid s_t, u)at+1πθ(st,u). In the second mode, memory-augmented prompting is used: at+1πθ(st,u,tipst)a_{t+1} \sim \pi_\theta(\cdot \mid s_t, u, \text{tips}_t)at+1πθ(st,u,tipst). During the update phase, trajectories from memory-augmented rollouts can be processed in one of two ways: on-policy, where the update uses the same tips as in the rollout, or off-policy, where the tips are removed and the policy is updated based on its unconditioned distribution πθ(st,u)\pi_\theta(\cdot \mid s_t, u)πθ(st,u).

The off-policy update mode is particularly critical for knowledge internalization. It functions as a form of reward-guided knowledge distillation: trajectories sampled under tip-conditioned policies serve as teacher demonstrations, while the student policy πθ(s,u)\pi_\theta(\cdot \mid s, u)πθ(s,u) is updated to reproduce high-advantage actions and suppress low-advantage ones. The importance sampling ratio ρθ\rho_\thetaρθ is adjusted accordingly: for off-policy updates, the log-probability under the tip-conditioned old policy is replaced with the log-probability under the unconditioned current policy. This mismatch enables the base policy to absorb the benefits of tip guidance without requiring tips at inference time.

To stabilize off-policy training, which is prone to gradient explosion, the authors introduce a masking mechanism. Tokens with probability below a threshold δ\deltaδ under πθ(st,u)\pi_\theta(\cdot \mid s_t, u)πθ(st,u) are excluded from the advantage-weighted loss. The modified GRPO loss becomes:

Eup(U){τ(i)}πθold[1NTi=1Nt=1Tmin(ρθ(i,t)A(at(i)),clip ⁣(ρθ(i,t),1ϵ,1+ϵ)A(at(i)))1πθ(at(i)st(i),u)δ]βDKL(πθ(u)πref(u)),\begin{array}{rl} \mathbb{E}_{\substack{u \sim p(\mathcal{U}) \\ \{\tau^{(i)}\} \sim \pi_{\theta_{\mathrm{old}}}}} \Bigg[ \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \operatorname*{min} \Big( \rho_{\theta}^{(i,t)} A(a_{t}^{(i)}), \, \operatorname{clip} \! \Big( \rho_{\theta}^{(i,t)}, 1 - \epsilon, 1 + \epsilon \Big) A(a_{t}^{(i)} ) \Big) \cdot \mathbf{1}_{\pi_{\theta}(a_{t}^{(i)} | s_{t}^{(i)}, u) \geq \delta} \Bigg] \\ \qquad - \beta D_{\mathrm{KL}} \big( \pi_{\theta}(\cdot | u) \parallel \pi_{\mathrm{ref}}(\cdot | u) \big), \end{array}Eup(U){τ(i)}πθold[NT1i=1Nt=1Tmin(ρθ(i,t)A(at(i)),clip(ρθ(i,t),1ϵ,1+ϵ)A(at(i)))1πθ(at(i)st(i),u)δ]βDKL(πθ(u)πref(u)),

where ρθ(i,t)=πθ(at(i)st(i),u)πθold(at(i)st(i),u,tipst)\rho_{\theta}^{(i,t)} = \frac{\pi_{\theta}(a_{t}^{(i)} | s_{t}^{(i)}, u)}{\pi_{\theta_{\mathrm{old}}}(a_{t}^{(i)} | s_{t}^{(i)}, u, \text{tips}_{t})}ρθ(i,t)=πθold(at(i)st(i),u,tipst)πθ(at(i)st(i),u) for off-policy updates.

To further encourage exploration, EMPO² incorporates an intrinsic reward based on state novelty. A memory list stores distinct states, and for each new state, its cosine similarity with existing entries is computed. If similarity falls below a threshold, the state is added to memory and assigned an intrinsic reward rintrinsic=1nr_{\text{intrinsic}} = \frac{1}{n}rintrinsic=n1, where nnn is the number of similar past states. This mechanism promotes exploration of novel states even in the absence of extrinsic rewards and helps maintain policy entropy.

As shown in the figure below, the agent’s learning trajectory evolves from initial failures to successful task completion, guided by non-parametric memory updates and off-policy internalization. The policy π0\pi_0π0 may fail to locate the red light bulb, but after reflecting and storing tips such as “The red light bulb isn’t here, so you need to find it first,” subsequent policies like π100\pi_{100}π100 leverage these tips to explore more effectively and ultimately succeed.

The overall architecture is implemented as a multi-step rollout with alternating policy updates. At each training iteration, the agent samples tasks, performs rollouts under either memory-augmented or memory-free prompting, generates and stores tips, and then updates the policy using one of the three learning modes. The memory buffer is implemented as a fast, retrievable key-value store with cosine similarity search and a fixed capacity, ensuring efficient access to relevant tips during rollouts.

Experiment

  • EMPO² significantly improves exploration and generalization in LLM agents by combining memory-augmented on-policy and off-policy learning, outperforming baselines like GRPO, Reflexion, and Retrospex on ScienceWorld and WebShop.
  • In ScienceWorld, EMPO² achieves over twice the performance gain of GRPO, mastering complex multi-step tasks and reaching maximum scores in several cases, demonstrating strong long-term planning and hypothesis testing.
  • On WebShop, EMPO² surpasses all baselines including GiGPO, showing superior success rates and scores due to enhanced exploration in web navigation and decision-making.
  • EMPO² adapts rapidly to novel tasks using memory updates, achieving 136% average improvement within 10 steps across diverse domains, while GRPO shows inconsistent or degraded performance.
  • Ablation studies confirm that both on-policy memory and off-policy updates are essential; removing either degrades performance, highlighting their complementary roles in stable and efficient learning.
  • Intrinsic reward is critical for preventing policy collapse and encouraging exploration, though its specific form or scale mainly affects learning speed, not final performance.
  • Memory integration adds moderate computational overhead (~19% rollout time) but delivers substantial gains in learning efficiency, with EMPO² remaining more time-efficient than GRPO overall.

The authors use EMPO² to enhance large language model agents by combining memory-augmented exploration with hybrid on- and off-policy reinforcement learning. Results show that EMPO² consistently outperforms baseline methods across ScienceWorld tasks, achieving significantly higher average returns and solving previously failed tasks to completion. The framework also demonstrates strong generalization, adapting rapidly to new tasks with minimal memory updates and no parameter changes.

The authors use EMPO² to enhance large language model agents by combining memory-augmented exploration with hybrid on- and off-policy reinforcement learning. Results show EMPO² outperforms all baselines in both ScienceWorld and WebShop, achieving higher scores and success rates while demonstrating stronger adaptability to novel tasks with minimal memory updates. The framework’s effectiveness stems from the complementary roles of its components, enabling faster convergence and more robust generalization without requiring parameter updates during adaptation.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp