HyperAIHyperAI

Command Palette

Search for a command to run...

Active Context Compression: LLM Agent에서의 자율적 메모리 관리

Nikhil Verma

초록

Large Language Model (LLM) 에이전트는 'Context Bloat(컨텍스트 비대화)' 현상으로 인해 장기적인(long-horizon) 소프트웨어 엔지니어링 작업을 수행하는 데 어려움을 겪습니다. 상호작용 이력이 증가함에 따라 계산 비용이 폭증하고 지연 시간(latency)이 늘어나며, 과거의 무관한 오류로 인한 방해(distraction)로 인해 추론 능력이 저하됩니다. 기존의 해결책들은 대개 에이전트가 제어할 수 없는 수동적인 외부 요약 메커니즘에 의존해 왔습니다.본 논문에서는 황색망사점균(Physarum polycephalum, slime mold)의 생물학적 탐색 전략에서 영감을 얻은 에이전트 중심 아키텍처인 Focus를 제안합니다. Focus Agent는 핵심 학습 내용을 지속 가능한 'Knowledge' 블록으로 통합할 시점을 자율적으로 결정하며, 원본 상호작용 이력을 능동적으로 회수(pruning)합니다. 업계의 베스트 프랙티스(persistent bash + string-replacement editor)를 반영하여 최적화된 스캐폴드(scaffold)를 구축하였으며, Claude Haiku 4.5를 사용하여 SWE-bench Lite의 컨텍스트 집약적 인스턴스 5개(N=5)를 대상으로 Focus를 평가했습니다.빈번한 압축을 유도하는 공격적인 prompting을 적용한 결과, Focus는 정확도를 동일하게 유지(두 에이전트 모두 3/5 = 60%)하면서도 22.7%의 token 감소(14.9M → 11.5M tokens)를 달성했습니다. Focus는 작업당 평균 6.0회의 자율 압축을 수행했으며, 개별 인스턴스에서는 최대 57%의 token 절감 효과를 보였습니다. 본 연구는 유능한 모델이 적절한 도구와 prompting이 제공될 때 컨텍스트를 자율적으로 자기 조절(self-regulate)할 수 있음을 입증하며, 이는 작업 성능을 희생하지 않으면서도 비용 효율적인 agentic 시스템으로 나아가는 경로를 제시합니다.

One-sentence Summary

Nikhil Verma proposes Focus, an agent-centric architecture inspired by the biological exploration strategies of Physarum polycephalum that enables LLM agents to autonomously manage context through active consolidation and pruning, achieving a 22.7% token reduction on SWE-bench Lite instances using Claude Haiku 4.5 while maintaining 60% accuracy.

Key Contributions

  • The paper introduces Focus, an agent-centric architecture inspired by the biological exploration strategies of slime mold that enables intra-trajectory context management. This method allows an agent to autonomously decide when to consolidate key learnings into a persistent knowledge block while actively pruning raw interaction history.
  • This work implements an optimized scaffold using industry best practices, such as a persistent bash interface and a string-replacement editor, to facilitate autonomous self-regulation of context. This approach enables the agent to summarize recent trajectories into high-level insights and physically delete redundant logs during a single task.
  • Evaluations on context-intensive SWE-bench Lite instances demonstrate that the Focus agent achieves a 22.7% total token reduction while maintaining identical task accuracy compared to standard agents. Results show that the method performs an average of 6.0 autonomous compressions per task, with individual instance token savings reaching up to 57%.

Introduction

As Large Language Model (LLM) agents tackle complex, long-horizon software engineering tasks, they face significant hurdles known as context bloat. Naive use of large context windows leads to quadratic cost increases, higher latency, and context poisoning where irrelevant trial-and-error logs distract the model from the primary task. While existing solutions use external memory hierarchies or separate compression models, they often rely on passive mechanisms that the agent cannot control during a continuous task trajectory. The authors propose Focus, an agent-centric architecture inspired by the biological exploration strategies of slime mold. This approach enables the agent to perform intra-trajectory compression by autonomously summarizing key learnings into a persistent knowledge block and actively pruning raw interaction history to maintain a lean, effective context.

Method

The authors introduce a novel architecture termed the Focus Loop, which augments the standard ReAct agent loop with two specialized primitives: start_focus and complete_focus. Unlike traditional approaches that rely on external timers or fixed heuristics to manage context length, this architecture grants the agent full autonomy to determine when to initiate and conclude a focus cycle.

The process begins when the agent invokes start_focus to declare a specific investigation objective, such as debugging a database connection. This action establishes a formal checkpoint within the conversation history. Following this checkpoint, the agent enters the exploration phase, where it utilizes standard tools like reading, editing, and executing code to perform its tasks.

As the agent reaches a natural conclusion to a sub-task or encounters a dead end, it invokes complete_focus. During this consolidation phase, the agent generates a structured summary that captures the attempted actions, the facts or bugs learned, and the final outcome. The system then executes a withdrawal process: the generated summary is appended to a persistent "Knowledge" block located at the top of the context, and all intermediate messages between the initial checkpoint and the current step are deleted. This mechanism transforms the context from a monotonically increasing log into a "Sawtooth" pattern, where the context expands during exploration and collapses during consolidation. This allows the model to manage its own context based on the inherent structure of the task rather than arbitrary step counts.

To support this loop, the authors implement an optimized scaffold designed for software engineering tasks. This scaffold consists of two primary tools: a Persistent Bash session and a String-Replace Editor. The Persistent Bash tool provides a stateful shell environment where the working directory and environment remain consistent across multiple calls, mimicking a real-world developer terminal. To ensure precise file manipulation, the String-Replace Editor allows for targeted edits through exact string replacement, including operations such as viewing, creating, replacing, and inserting text. This approach avoids the common errors associated with full-file rewrites. The agent is guided by a system prompt to utilize these tools extensively, with a maximum limit of 150 steps, and is encouraged to implement tests before attempting to solve the primary problem.

Experiment

The Focus architecture was evaluated on five context-intensive SWE-bench Lite instances using an A/B comparison to determine if aggressive context compression could reduce token usage without sacrificing task accuracy. The experiments demonstrate that directive prompting, which enforces frequent and structured compression phases, enables significant token savings while maintaining the same success rate as a baseline agent. While the architecture is highly effective for exploration-heavy tasks, the results also indicate that compression overhead can occasionally exceed benefits in tasks requiring continuous iterative refinement.

The authors compare a baseline agent against their Focus architecture on context-intensive software engineering tasks. Results show that the Focus agent achieves significant token reductions while maintaining the same level of task success as the baseline. The Focus agent achieves a substantial reduction in total token consumption compared to the baseline. Task success rates remain identical between the baseline and the Focus architecture. The Focus approach utilizes frequent compressions and message dropping to manage context efficiency.

The authors evaluate the Focus architecture against a baseline agent on context-intensive software engineering tasks to test its ability to manage context efficiency. By utilizing frequent compressions and message dropping, the Focus approach significantly reduces total token consumption. Ultimately, the architecture maintains the same level of task success as the baseline while operating much more efficiently.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
Active Context Compression: LLM Agent에서의 자율적 메모리 관리 | 문서 | HyperAI초신경