HyperAIHyperAI

Command Palette

Search for a command to run...

AI 에이전트 시대의 메모리

초록

메모리는 기초 모델 기반 에이전트의 핵심 역량으로 부상하였으며, 앞으로도 지속적으로 그 중심적인 역할을 할 것으로 예상된다. 에이전트 메모리에 관한 연구가 급속히 확장되면서 예전 없이 큰 주목을 받고 있는 가운데, 이 분야는 점점 더 분열된 상태에 이르렀다. 현재 ‘에이전트 메모리’라는 범주에 포함되는 기존 연구들은 동기 부여, 구현 방식, 평가 프로토콜 면에서 상당한 차이를 보이며, 모호하게 정의된 메모리 용어의 증가로 인해 개념적 명확성은 더욱 흐려지고 있다. 전통적인 분류 체계인 장기/단기 메모리 같은 기존 분류법은 현대 에이전트 메모리 시스템의 다양성을 충분히 포괄하지 못하고 있다. 본 연구는 현재의 에이전트 메모리 연구 동향을 최신 상태로 종합적으로 제시하는 것을 목적으로 한다. 먼저 에이전트 메모리의 범위를 명확히 정의하고, 대규모 언어 모델(LLM) 메모리, 검색 기반 생성(RAG), 그리고 컨텍스트 엔지니어링과 같은 관련 개념들과의 차이점을 명확히 구분한다. 이후 메모리의 형태, 기능, 동역학이라는 통합적 관점에서 에이전트 메모리를 분석한다. 형태 관점에서는 토큰 수준 메모리, 파라메트릭 메모리, 그리고 잠재적 메모리라는 세 가지 주요 실현 방식을 식별한다. 기능 관점에서는 사실적 메모리, 경험적 메모리, 작업용 메모리라는 더 세밀한 분류 체계를 제안한다. 동역학 관점에서는 메모리가 시간에 따라 어떻게 형성되고 진화하며, 어떻게 접근되는지를 분석한다. 실용적 개발을 지원하기 위해 메모리 평가 벤치마크와 오픈소스 프레임워크의 포괄적인 요약을 제공한다. 통합을 넘어서, 메모리 자동화, 강화 학습 통합, 다중 모달 메모리, 다에이전트 메모리, 신뢰성 문제 등 등장하는 연구 전선에 대한 전망을 제시한다. 이 조사가 기존 연구에 대한 참고 자료를 넘어서, 미래의 에이전트 지능 설계에서 메모리를 일류 원시(primitive)로 재고할 수 있는 개념적 기반을 제공하기를 기대한다.

One-sentence Summary

Researchers from National University of Singapore, Renmin University of China, Fudan University, Peking University, and collaborating institutions present a comprehensive "forms-functions-dynamics" taxonomy for agent memory, identifying token-level/parametric/latent memory forms, factual/experiential/working memory functions, and formation/evolution/retrieval dynamics to advance persistent, adaptive capabilities in LLM-based agents beyond traditional long/short-term memory distinctions.

Key Contributions

  • The paper addresses the growing fragmentation in AI agent memory research, where inconsistent terminology and outdated taxonomies like long/short-term memory fail to capture the diversity of modern systems, hindering conceptual clarity and progress.
  • It introduces a unified three-dimensional taxonomy organizing agent memory by forms (token-level, parametric, latent), functions (factual, experiential, working memory), and dynamics (formation, evolution, retrieval), moving beyond coarse temporal categorizations.
  • Supporting this framework, the survey compiles representative benchmarks and open-source memory frameworks while mapping existing systems into the taxonomy through Figure 1, and identifies emerging frontiers like multimodal memory and reinforcement learning integration.

Introduction

The authors highlight that memory has become a cornerstone capability for foundation model-based AI agents, enabling long-horizon reasoning, continual adaptation, and effective interaction with complex environments. As agents evolve beyond static language models into interactive systems for applications like personalized chatbots, recommender systems, and financial investigations, robust memory mechanisms are essential to transform fixed-parameter models into adaptive systems that learn from environmental interactions. Prior work faces significant fragmentation with inconsistent terminology, divergent implementations, and insufficient taxonomies—traditional distinctions like long/short-term memory fail to capture contemporary systems' complexity while overlapping concepts like LLM memory, RAG, and context engineering create conceptual ambiguity. To address these challenges, the authors establish a comprehensive "forms-functions-dynamics" framework that categorizes memory into three architectural forms (token-level, parametric, and latent), three functional roles (factual, experiential, and working memory), and detailed operational dynamics covering memory formation, retrieval, and evolution. This unified taxonomy clarifies conceptual boundaries, reconciles fragmented research, and provides structured analysis of benchmarks, frameworks, and emerging frontiers including reinforcement learning integration, multimodal memory, and trustworthy memory systems.

Dataset

The authors survey two primary categories of evaluation benchmarks for assessing LLM agent memory and long-term capabilities:

  • Memory/Lifelong/Self-Evolving Agent Benchmarks

    • Composition: Explicitly designed for memory retention, lifelong learning, or self-improvement (e.g., MemBench, LoCoMo, LongMemEval).
    • Key details:
      • Focus on factual/experiential memory, multimodal inputs, and simulated/real environments.
      • Sizes range from hundreds to thousands of samples/tasks (e.g., MemBench for user modeling, LongMemEval tracking catastrophic forgetting).
      • Filtering emphasizes controlled memory retention, preference tracking, or multi-episode adaptation.
    • Usage: Evaluated via Table 8, which categorizes benchmarks by memory focus, modality, and scale (e.g., LoCoMo tests preference consistency; LifelongAgentBench measures forward/backward transfer).
  • Other Related Benchmarks

    • Composition: Originally for tool use, embodiment, or reasoning but stress long-horizon memory (e.g., WebShop, ALFWorld, SWE-Bench Verified).
    • Key details:
      • Embodied (ALFWorld), web-based (WebArena), or multi-task (AgentGym) setups.
      • Implicitly test context retention across sequential actions (e.g., WebShop requires recalling prior navigation steps).
      • Scales vary: WebArena uses task-based evaluation; GAIA assesses multi-step research.
    • Usage: Table 9 compares frameworks supporting these benchmarks, noting memory types (factual/experiential), multimodality, and internal structures (e.g., MemoryBank for episodic knowledge consolidation).

The paper uses these benchmarks solely for evaluation—not training—to measure long-context retention, state tracking, and adaptation. No data processing (e.g., cropping) is applied; instead, benchmarks are analyzed via structured feature comparisons in Tables 8–9, highlighting memory mechanisms like self-reflection (Evo-Memory) or tool-augmented storage (MemoryAgentBench).

Method

The authors leverage a comprehensive, multi-faceted framework for LLM-based agent memory systems, which integrates distinct memory forms, functional roles, and dynamic lifecycle processes to enable persistent, adaptive, and goal-directed behavior. The overall architecture is not monolithic but rather a layered ecosystem where token-level, parametric, and latent memory coexist and interact, each serving complementary purposes based on the task’s demands for interpretability, efficiency, or performance.

At the core of the agent loop, each agent iIi \in \mathcal{I}iI observes the environment state sts_tst and receives an observation o˙ti=Oi(st,hti,Q)\dot{o}_t^i = O_i(s_t, h_t^i, \mathcal{Q})o˙ti=Oi(st,hti,Q), where htih_t^ihti represents the agent’s accessible interaction history and Q\mathcal{Q}Q is the fixed task specification. The agent then executes an action at=πi(Oti,mti,Q)a_t = \pi_i(\mathcal{O}_t^i, m_t^i, \mathcal{Q})at=πi(Oti,mti,Q), where mtim_t^imti is a memory-derived signal retrieved from the evolving memory state MtM\mathcal{M}_t \in \mathbb{M}MtM. This memory state is not a static buffer but a dynamic knowledge base that undergoes continuous formation, evolution, and retrieval, forming a closed-loop cognitive cycle.

The memory system’s architecture is structured around three primary forms, each with distinct representational properties and operational characteristics. Token-level memory, as depicted in the taxonomy, organizes information as explicit, discrete units that can be individually accessed and modified. It is further categorized into flat (1D), planar (2D), and hierarchical (3D) topologies. Flat memory stores information as linear sequences or independent clusters, suitable for simple chunking or dialogue logs. Planar memory introduces explicit relational structures such as graphs or trees within a single layer, enabling richer semantic associations and structured retrieval. Hierarchical memory extends this by organizing information across multiple abstraction layers, supporting coarse-to-fine navigation and cross-layer reasoning, as seen in pyramid or multi-layer architectures.

Parametric memory, in contrast, stores information directly within the model’s parameters, either by internalizing knowledge into the base weights or by attaching external parameter modules like adapters or LoRA. This form is implicit and abstract, offering performance gains through direct integration into the model’s forward pass but at the cost of slower updates and potential catastrophic forgetting. Latent memory operates within the model’s internal representational space, encoding experiences as continuous embeddings, KV caches, or hidden states. It is human-unreadable but machine-native, enabling efficient, multimodal fusion and low-latency inference, though it sacrifices transparency and editability.

The functional architecture of the memory system is organized around three pillars: factual, experiential, and working memory. Factual memory serves as a persistent declarative knowledge base, ensuring consistency with user preferences and environmental states. Experiential memory encapsulates procedural knowledge, distilling strategies and skills from past trajectories to enable continual learning. Working memory provides a dynamic, bounded workspace for active context management during a single task or session, addressing both single-turn input condensation and multi-turn state maintenance.

The operational dynamics of the memory system are governed by a cyclical lifecycle of formation, evolution, and retrieval. Memory formation transforms raw experiences into information-dense knowledge units through semantic summarization, knowledge distillation, structured construction, latent representation, or parametric internalization. Memory evolution then integrates these new units into the existing repository through consolidation, updating, and forgetting mechanisms, ensuring coherence, accuracy, and efficiency. Finally, memory retrieval executes context-aware queries to access relevant knowledge at the right moment, involving timing, query construction, retrieval strategies, and post-processing to deliver concise, coherent context to the LLM policy.

This entire framework is designed to be flexible and composable. Different agents may instantiate different subsets of these operations at varying temporal frequencies, giving rise to memory systems that range from passive buffers to actively evolving knowledge bases. The authors emphasize that the choice of memory type and mechanism is not arbitrary but reflects the designer’s intent for how the agent should behave in a given task, balancing trade-offs between interpretability, efficiency, and performance. The architecture thus supports a wide spectrum of applications, from multi-turn chatbots and personalized agents to reasoning-intensive tasks and multimodal, low-resource settings.

Experiment

  • Comparative analysis of open-source memory frameworks for LLM agents validates support for factual memory (vector/structured stores) and growing integration of experiential traces (dialogue histories, episodic summaries) and multimodal memory
  • Frameworks span agent-centric systems with hierarchical memory (e.g., MemGPT, MemoryOS) to general-purpose backends (e.g., Pinecone, Chroma), with many implementing short/long-term separation and graph/profile-based memory spaces
  • While some frameworks report initial results on memory benchmarks, most focus on providing scalable databases and APIs without standardized agent behavior evaluation protocols

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp