Command Palette
Search for a command to run...
AUTOMEM: Automatisiertes Erlernen von Gedächtnis als kognitiver Fertigkeit
AUTOMEM: Automatisiertes Erlernen von Gedächtnis als kognitiver Fertigkeit
Shengguang Wu Hao Zhu Yuhui Zhang Xiaohan Wang Serena Yeung-Levy
Zusammenfassung
Gedächtnisexpertise ist eine erlernbare Fertigkeit: zu wissen, was zu enkodieren ist, wann abgerufen werden soll und wie Wissen organisiert wird – eine Fähigkeit, die in der Kognitionswissenschaft als Metagedächtnis bezeichnet wird. Wir übertragen diese Perspektive auf LLMs, indem wir Gedächtnismanagement als trainierbare Fertigkeit behandeln. Wir erheben Dateisystemoperationen zu erstklassigen Gedächtnisaktionen parallel zu Aufgabenaktionen und lassen das Modell selbst entscheiden, wie es sein Gedächtnis verwaltet. Diese Gedächtnisfertigkeit verbessert sich entlang zweier Achsen: der Struktur, die sie unterstützt (Prompts, Dateischemata, Aktionsvokabular), und der Kompetenz des Modells, das sie ausübt. Beide Achsen widersetzen sich manueller Optimierung: Episoden in langfristigen Aufgaben erstrecken sich über Tausende von Schritten, und ein einzelner Gedächtnisfehler kann lange verborgen bleiben, bevor er zutage tritt, was eine menschliche Überprüfung vollständiger Trajektorien unpraktikabel macht. Wir stellen AUTOMEM vor, ein Framework, das beide Achsen automatisiert: Auf der Strukturachse konstruiert es Gedächtnisstrukturen durch evolutionäre Suche über Weltmodell-Zwillinge; auf der Kompetenzachse erstellt es Gedächtnis-Feinabstimmungsdaten, indem es über lange Trajektorien hinweg extrapoliert, welche Aktionen des agenteneigenen Gedächtnisses gute Gedächtnisentscheidungen waren. Dies ermöglicht es dem Agenten, sein Gedächtnis direkt zu trainieren. In drei prozedural generierten Langzeitspielen (Crafter, MiniHack und NetHack) verbesserte die Optimierung allein des Gedächtnisses – ohne Änderung des Aufgabenverhaltens des Modells – die Leistung des Basisagenten um das 2bis 4-Fache und brachte ein offen gewichtetes 32B-Modell auf Wettbewerbsniveau mit Frontier-Systemen wie Claude Opus 4.5 und Gemini 3.1 Pro Thinking. Unsere Ergebnisse zeigen, dass Gedächtnismanagement eine unabhängig erlernbare Fertigkeit ist und ein Ziel mit großer Hebelwirkung darstellt, das große Gewinne bei langfristigen Aufgaben bringt.
One-sentence Summary
Stanford researchers propose AUTOMEM, which treats memory management as a learnable cognitive skill by using two iterative loops to automate memory-structure design and model proficiency, boosting performance by ∼2–4× on the procedurally generated long-horizon games Crafter, MiniHack, and NetHack and making a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking.
Key Contributions
- AUTOMEM is a framework that automates the optimization of an LLM agent's memory skill along two axes: the memory structure (prompts, file schemas, action vocabulary) and the model's proficiency at memory decisions.
- The framework employs a dual-loop design where one loop uses a strong LLM to review complete long-horizon trajectories and iteratively revise the memory scaffold, and a second loop extracts successful memory decisions from many episodes as training signal to sharpen the agent's memory proficiency without altering task-action behavior.
- Across Crafter, MiniHack, and NetHack, optimizing memory alone with AUTOMEM yields 2× to 4× performance gains, lifting a 32B open-weight model to be competitive with Claude Opus 4.5 and Gemini 3.1 Pro Thinking, and demonstrating that memory management is an independently learnable, high-leverage skill.
Introduction
The authors tackle the bottleneck that fixed-size context windows impose on large language models (LLMs) by treating external memory management as an independently trainable skill rather than a fixed architectural module. Prior work typically hard-codes retrieval mechanisms or static buffers into the system, but manually optimizing memory behavior across episodes spanning tens of thousands of steps is intractable for humans. The authors promote file-system operations as first-class actions and introduce AUTOMEM, a framework in which a meta-LLM reviews complete episode traces to automatically improve both the memory scaffold (the prompts, file schema, and validation logic) and the agent’s parametric memory proficiency through targeted finetuning of a dedicated memory specialist, yielding two- to four-fold performance gains on long-horizon game environments.
Method
The authors proposeAUTOMEM, a framework that automates the optimization of memory skills along two axes: structure and proficiency. This is achieved through two sequential outer loops that operate on a shared inner-loop agent.
The inner loop consists of a single LLM agent executing an episode of a long-horizon task. The agent is equipped with a directory of files on disk that serves as its external memory. At each step, the agent runs two routines. The LOG routine determines what information is worth recording about the environment's response to the previous action, such as appending to an existing file, creating a new one, or rewriting an entry. The PLAN routine determines what needs to be recalled to act now by searching across files, reading specific entries, and committing the next world action. By promoting file-system operations as first-class memory actions in the model's action space, memory becomes a learnable skill rather than a fixed mechanism.
The first outer loop optimizes the structure supporting the memory skill. The agent scaffold, which includes the code, prompts, file schema, and action vocabulary, is iteratively revised by a meta-LLM. Because the consequences of memory decisions are often delayed in long-horizon tasks, the optimization signal must be trajectory-level. The meta-LLM is provided with full episode traces, including per-step logs, the resulting memory directories, and the agent code itself. It functions as a code reviewer, identifying points where the scaffold caused failures. For example, the meta-LLM might identify that an unbounded map file is accumulating duplicate entries and respond by introducing a coordinate-keyed deduplication format.
Each iteration is gated on measured improvement. The rewritten agent plays the same fixed seeds as the previous version, and the revision is kept only if the average progression improves. At rough convergence, typically after a few iterations, the scaffold has absorbed what code revision can express.
Once the scaffold is optimized, the model's parametric ability to navigate its memory becomes the remaining bottleneck. The second outer loop addresses this by training the model's memory proficiency. The meta-LLM acts as a training engine, orchestrating the supervised training process end-to-end. It reads the inner-loop agent code and a pool of episode traces to derive selection criteria and produce supervised training data. Every example in the training set is verbatim text produced by the inner-loop model during an episode. The meta-LLM selects which responses to reinforce, acting as a filter on the model's own behavior.
To ensure the training effectively absorbs the curated data, the meta-LLM training engine jointly orchestrates the data selection logic, data composition, and LoRA training configuration as a single decision, refining all three across iterative trials. Because memory is a separable skill, it is trained as a distinct target rather than finetuning on full episodes. During inference, the inner loop runs two model instances sharing a single conversation history. The memory specialist, a LoRA-finetuned copy, handles the LOG routine and the memory-consultation portion of the PLAN routine. The gameplay model, the unmodified base, commits the world action. This separation ensures the training signal stays focused on memory-operation behavior while preserving the base model's competence at producing well-formatted world actions.
Experiment
The experiments evaluate a Qwen2.5-32B-Instruct agent on three procedurally generated long-horizon games, where scaffold optimization that revises prompts and external memory schemas alone more than triples task progression without any model weight changes, outperforming both larger models and standard context-window approaches. Adding a separate memory-proficiency training loop further improves performance, bringing the open-weight agent to a level competitive with frontier proprietary systems. Qualitative analysis reveals that the optimized scaffold reduces wasteful task repetition and redundant memory operations while compressing context, and the trained specialist internalizes a consult-before-write pattern that improves memory use, demonstrating that memory management is a highly effective axis for long-horizon agent behavior.
Starting from a 32B open-weight model with basic context management, optimizing the agent’s memory scaffold doubles or triples progression rates across long-horizon games, and adding a memory-training loop yields further complementary gains that bring the model close to frontier proprietary systems. The improvements derive from structural changes that slash redundant logging by 95% and from training the model to consult existing memory before writing, reducing memory waste. Scaffold optimization alone multiplies progression rate by 1.9× on Crafter, 3.7× on MiniHack, and 3.7× on NetHack. With both scaffold optimization and memory training, the 32B model reaches 51.4% on Crafter, 30.0% on MiniHack, and 1.9% on NetHack, comparable to Claude-Opus-4.5 (49.5%, 27.5%, 2.0%). The evolved memory structure reduces per-step memory growth from 138 to 6 characters (95% reduction), eliminating unbounded append-only log bloat.
Training a memory specialist sharply reduces memory writes per search, instilling a consult-before-write discipline. The specialist searches existing files before appending new content, cutting the write-to-search ratio by more than half in Crafter and by over 70% in MiniHack and NetHack. This internalized pattern complements the optimized scaffold's structured memory, leading to more efficient and goal-directed agent behavior. The write-to-search ratio drops from 0.84 to 0.39 in Crafter, 2.89 to 0.82 in MiniHack, and 4.66 to 1.31 in NetHack after training the specialist. The specialist consistently searches memory before writing, avoiding blind appends and reducing redundant logging by up to 72%.
Optimizing the agent’s memory scaffold and training a memory specialist to consult memory before writing dramatically improve performance on long-horizon tasks, bringing a 32B open-weight model close to frontier proprietary systems. Restructuring memory alone eliminates 95% of per-step log bloat and boosts progression rates, while memory training cuts redundant writes by more than half by instilling a disciplined consult-before-write pattern.