Command Palette
Search for a command to run...
공자 코드 에이전트: 실세계 코드베이스를 위한 확장 가능한 에이전트 스케일링
공자 코드 에이전트: 실세계 코드베이스를 위한 확장 가능한 에이전트 스케일링
초록
실제 소프트웨어 공학 작업은 대규모 코드 저장소에서 작동할 수 있으며, 장기적인 작업 세션을 유지하고 테스트 시 복잡한 도구 체인을 신뢰성 있게 조율할 수 있는 코드 에이전트를 필요로 한다. 기존의 연구용 코드 에이전트는 투명성은 제공하지만, 실용 수준의 무거운 워크로드로 확장할 경우 어려움을 겪는 반면, 실용용 시스템은 뛰어난 실용적 성능을 달성하지만 확장성, 해석 가능성, 제어 가능성 측면에서 제한이 있다. 본 연구에서는 대규모 코드베이스에서 작동할 수 있는 소프트웨어 공학 에이전트인 'Confucius Code Agent(이하 CCA)'를 제안한다. CCA는 세 가지 상호보완적인 관점 — 에이전트 경험(AX, Agent Experience), 사용자 경험(UX, User Experience), 개발자 경험(DX, Developer Experience) — 을 기반으로 설계된 Confucius SDK 위에 구축된다. 이 SDK는 장기적 맥락 추론을 위한 계층적 작업 메모리와 통합된 오케스트레이터, 세션 간 지속적 학습을 위한 지속적 노트 기록 시스템, 신뢰성 있는 도구 사용을 위한 모듈식 확장 시스템을 통합하고 있다. 또한, 빌드-테스트-개선 루프를 통해 에이전트 구성의 자동 합성, 평가 및 정교화를 수행하는 메타에이전트를 도입하여, 새로운 작업, 환경, 도구 스택에 빠르게 적응할 수 있도록 했다. 이러한 메커니즘을 기반으로 구현된 CCA는 실제 소프트웨어 공학 작업에서 뛰어난 성능을 보였다. SWE-Bench-Pro 벤치마크에서 CCA는 동일한 저장소, 모델 백엔드, 도구 접근 조건 하에 Resolve@1 기준 54.3%의 성과를 기록하며, 이는 이전의 연구 기준을 초과하고 상용 시스템의 결과와도 경쟁할 수 있는 수준이다.
One-sentence Summary
The authors from Meta and Harvard propose the Confucius Code Agent (CCA), a scalable software engineering agent built on the Confucius SDK, which integrates hierarchical working memory, adaptive context compression, persistent note-taking, and a meta-agent for automated configuration refinement; by explicitly separating Agent, User, and Developer Experience, CCA achieves state-of-the-art performance on SWE-Bench-Pro (54.3% Resolve@1) and demonstrates superior robustness in real-world debugging tasks, outperforming both research and commercial systems under identical conditions.
Key Contributions
-
Real-world software engineering demands coding agents capable of long-horizon, multi-file tasks in large codebases, yet existing systems either lack scalability (research-grade agents) or suffer from poor interpretability and extensibility (production-grade systems), creating a critical gap in agentic software engineering.
-
The Confucius Code Agent (CCA) addresses this through the Confucius SDK, which systematically separates Agent Experience (AX), User Experience (UX), and Developer Experience (DX), enabling scalable reasoning via hierarchical working memory, persistent note-taking for cross-session learning, and modular tool extensions with full observability.
-
On SWE-Bench-Pro, CCA achieves a Resolve@1 of 54.3%, outperforming prior research baselines and matching commercial systems under identical conditions, with ablation studies confirming the critical role of its meta-agent-driven configuration refinement and hierarchical memory mechanisms.
Introduction
The authors address the growing need for coding agents capable of handling large-scale, real-world software engineering tasks—such as navigating massive codebases, maintaining long-horizon workflows, and orchestrating complex toolchains—where existing systems fall short. Prior work faces a fundamental trade-off: research-grade agents offer transparency and interpretability but fail to scale to production workloads, while production systems deliver strong performance at the cost of limited extensibility, controllability, and reproducibility. The key challenges lie in sustaining long-context reasoning across dispersed code modules and enabling persistent, cross-session learning to avoid redundant effort. To overcome these, the authors introduce the Confucius Code Agent (CCA), built on the Confucius SDK, a platform structured around three distinct design axes: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK enables scalable agent behavior through a hierarchical working memory for efficient long-context reasoning, a persistent note-taking system that captures reusable insights and failure patterns, modular extensions for reliable tool integration, and a meta-agent that automates configuration synthesis and refinement via a build-test-improve loop. This principled scaffolding allows CCA to achieve a Resolve@1 of 54.3% on SWE-Bench-Pro—surpassing prior research baselines and matching commercial systems—demonstrating that effective agent design, not just model size, is critical for real-world impact.
Dataset
- The dataset is composed of real-world software engineering tasks drawn from GitHub issues and code repositories, specifically from the PyTorch and Open Library projects.
- For PyTorch-Bench, the authors collected 8 reproducible issues from the PyTorch GitHub repository between January and July 2025, selecting only those with clear reproduction scripts, detailed descriptions, and replication instructions.
- For the Open Library subset, the dataset includes project-specific and shared knowledge notes generated by an agent, organized into a hierarchical structure with directories for projects, shared utilities, and individual markdown files documenting edge cases and fixes.
- Each subset is processed to extract actionable insights: issues are filtered for reproducibility and technical clarity, while notes are structured with metadata including ID, title, description, and keywords for easy retrieval.
- The authors use the data in a training mixture where both PyTorch-Bench and Open Library tasks are combined, with balanced ratios to ensure diverse exposure to different types of software debugging challenges.
- A cropping strategy is applied by focusing only on relevant code files and test cases—such as the find_author() function and its associated test file—ensuring the agent operates on minimal, contextually sufficient input.
- Metadata is constructed for each note using standardized fields (id, title, description, keywords) to support semantic search and retrieval during agent execution.
Method
The Confucius SDK is designed around a three-axis framework that prioritizes Agent Experience (AX), User Experience (UX), and Developer Experience (DX) as interdependent and first-class design concerns. This holistic approach is reflected in the system's architecture, which separates the internal cognitive workspace of the agent from the external interfaces for users and developers. The core of the system is the Confucius Orchestrator, a minimal execution loop that manages the interaction between the large language model (LLM), the agent's memory, and external tools. The orchestrator operates in a bounded loop, invoking the LLM with a system prompt and the current memory state, parsing the model's output into structured actions, and routing these actions to the appropriate extensions for execution. This process continues until the agent signals completion or the maximum iteration limit is reached. Extensions, which are modular components that attach to the orchestrator, handle specific tasks such as parsing model outputs, executing tools, and shaping prompts. This separation allows for a clean division of responsibilities and enables the reuse of behaviors across different agents.
The framework's design philosophy is illustrated in the Confucius Core diagram, which shows the distinct channels for AX, UX, and DX. The agent's internal cognitive workspace (AX) is optimized for efficiency and minimal noise, receiving a distilled and structured version of the conversation history. In contrast, the user experience (UX) is designed for transparency and interpretability, providing rich, instrumented traces that allow users to observe and interact with the agent's behavior. The developer experience (DX) is supported by observability tools that provide insight into both the agent's reasoning (AX) and its external behavior (UX), enabling reproducibility, debugging, and rapid iteration. This separation of concerns ensures that each axis can be optimized independently without compromising the others.
A key challenge in long-running agent tasks is managing the context window of the LLM, which can be overwhelmed by the accumulation of conversation history. The Confucius SDK addresses this with an adaptive context compression mechanism. When the effective prompt length approaches configurable thresholds, a dedicated planner agent, the Code Architect, is invoked to analyze the conversation history and construct a structured summary. This summary, which explicitly preserves key information categories such as task goals, decisions made, open TODOs, and critical error traces, replaces the original large spans of history. The system maintains a rolling window of recent interactions in their original form, ensuring that the agent can sustain multi-step reasoning over long trajectories without exceeding context limits. This approach provides a more robust and semantically aware alternative to fixed-window truncation or simple retrieval methods.
The Confucius Code Agent (CCA) is a specific instantiation of the Confucius SDK, built using a meta-agent that automates the agent development process. The meta-agent operates within a build-test-improve loop, where it synthesizes agent configurations, wires together orchestrator components and extensions, evaluates candidate agents on representative tasks, and iteratively refines prompts and tool-use policies based on observed failures. This automated process allows for the rapid development of agents that are tailored to specific use cases and environments. The CCA is composed of the orchestrator, memory, and extensions, with the memory system including a hierarchical note-taking agent that turns interaction traces into structured persistent knowledge. This knowledge is stored as Markdown files in a file-system-like tree, enabling agents to retrieve and reuse information across sessions, including failure cases and known fixes. The extension system, which includes components like the Bash tool and file edit tool, provides a modular and extensible way to integrate new capabilities into the agent. The overall architecture of the CCA, as shown in the diagram, demonstrates how the Confucius SDK's components work together to create a robust and adaptable coding agent.
Experiment
- On SWE-Bench-Pro, CCA achieves 52.7% Resolve@1 with Claude 4.5 Sonnet and 54.3% with Claude 4.5 Opus, surpassing Live-SWE-Agent (45.8%) and Anthropic’s proprietary scaffold (52.0%), demonstrating that improved agentic scaffolding alone drives performance gains.
- Ablation studies show that learned tool-use via the Meta-agent contributes significantly to performance, with Resolve@1 dropping substantially when disabled, even with advanced context management intact.
- Hierarchical context management improves Resolve@1 by +6.6 points (42.0 to 48.6) on Claude 4 Sonnet, enabling deeper reasoning and reducing prompt length by over 40% without losing key information.
- CCA maintains stable performance across multi-file edit scenarios, with only moderate degradation on tasks modifying more files, indicating robustness in complex refactoring.
- Long-term memory via note-taking reduces token cost by 11k and iteration turns by 3, while improving Resolve@1 from 53% to 54.4% on repeated SWE-Bench-Pro tasks, showing effective cross-session learning.
- On SWE-Bench-Verified, CCA achieves 74.6% Resolve Rate with Claude 4 Sonnet, outperforming OpenHands and a mini-SWE-Agent variant using Claude 4.5 Sonnet, underscoring the impact of scaffolding over backbone model strength.
- Thinking budget scaling shows diminishing returns beyond 16k tokens on SWE-Bench-Verified, with no precise control over internal reasoning length due to model-level abstraction.
- Case studies on real PyTorch issues reveal CCA’s minimal, principled fixes align with eventual PyTorch team solutions, while CC’s multi-agent approach leads to over-engineered, context-fragmented solutions despite higher complexity.
The authors use CCA, a novel agent scaffold, to achieve higher performance on SWE-Bench-Pro compared to existing baselines and proprietary systems. Results show that CCA consistently surpasses SWE-Agent and Live-SWE-Agent across different backbone models, reaching a Resolve Rate of 54.3% with Claude 4.5 Opus, outperforming the results reported by Anthropic's proprietary system.

The authors use Table 5 to compare CCA against open-source scaffolds on the SWE-Bench-Verified benchmark, showing that CCA achieves a Resolve Rate of 74.6% with Claude 4 Sonnet, outperforming the strongest open-source system, OpenHands, under identical backbone conditions. This result demonstrates that enhanced agentic scaffolding can close or surpass the performance gap introduced by differences in backbone model capability.

The authors use the SWE-Bench-Pro benchmark to compare CCA against various agent scaffolds and backbone models, showing that CCA consistently outperforms the SWE-Agent baseline across different models. With Claude 4.5 Opus, CCA achieves a Resolve@1 rate of 54.3%, surpassing both the best research-grade coding agent and the proprietary system from Anthropic, demonstrating that improvements stem from enhanced scaffolding rather than model or evaluation differences.

The authors analyze CCA's performance on SWE-Bench-Pro as a function of the number of files modified, grouping tasks into edited-file buckets. Results show that CCA maintains stable performance across different edit volumes, with the highest resolve rate of 57.8% for tasks modifying 1–2 files and a moderate decline to 44.4% for tasks modifying 10 or more files.

Results show that CCA's performance improves significantly with advanced context management and tool use. For Claude 4.5 Sonnet, enabling both advanced context management and advanced tool use increases the Resolve Rate from 44.0% to 51.6%, demonstrating the combined impact of these features.
