Command Palette
Search for a command to run...
스킬에서 인재로: 현실의 기업으로서 이질적인 에이전트들을 조직하기
스킬에서 인재로: 현실의 기업으로서 이질적인 에이전트들을 조직하기
Zhengxu Yu Yu Fu Zhiyuan He Yuxuan Huang Lee Ka Yiu Meng Fang Weilin Luo Jun Wang
초록
개별 agent의 역량은 모듈형 스킬과 도구 통합을 통해 급속도로 진전되었으나, multi-agent 시스템은 여전히 고정된 팀 구조, 긴밀하게 결합된 조정 논리, 그리고 세션에 종속적인 학습으로 인해 제약되어 있습니다. 우리는 이러한 현상이 더 깊은 차원의 부재, 즉 개별 agent가 무엇을 아는지와는 별도로, agent 집단이 어떻게 조립되고, 거버넌스되며, 시간이 지남에 따라 개선되는지를 지배하는 원칙적인 조직적 레이어의 결여를 반영한다고 주장합니다. 이 격차를 해소하기 위해 우리는 multi-agent 시스템을 조직적 수준으로 격상시키는 프레임워크인 OneManCompany (OMC)를 소개합니다. OMC는 스킬, 도구, 런타임 구성을 Talent라고 불리는 이식 가능한 agent 정체성으로 캡슐화하며, 이러한 Talent는 이질적인 백엔드를 추상화하는 유형화된 조직 인터페이스를 통해 오케스트레이션됩니다. 커뮤니티 주도의 Talent Market은 필요 시점에 인력을 채용할 수 있게 하여, 조직이 실행 중에 능력 격차를 메우고 동적으로 재구성할 수 있도록 합니다. 조직적 의사결정은 Explore-Execute-Review (extE2R) 트리 탐색을 통해 실행화되며, 이는 계획, 실행, 평가를 단일 계층적 루프로 통합합니다. 이 루프에서 작업은 상위에서 하위로 책임 있는 단위로 분해되고, 실행 결과는 하위에서 상위로 집계되어 체계적인 검토와 개선을 주도합니다. 이러한 루프는 종료와 교착 상태(deadlock) 부재에 대한 형식적 보장을 제공하면서도 인간 기업의 피드백 메커니즘을 반영합니다. 이러한 공헌들을 통해 multi-agent 시스템은 정적이고 사전 구성된 파이프라인에서, 다양한 도메인의 열린 과제에 적응할 수 있는 자기 조직화 및 자기 개선형 AI 조직으로 변모합니다. PRDBench에서의 실증적 평가 결과, OMC는 84.67%의 성공률을 기록하여 기존 최선 방법(State of the Art)보다 15.48퍼센트포인트 앞섰으며, 도메인 간 사례 연구를 통해 그 보편성 또한 입증되었습니다.
One-sentence Summary
The authors introduce OneManCompany (OMC), a multi-agent framework that decouples organizational governance from individual capabilities through portable Talents and a dynamic Talent Market, while leveraging an Explore-Execute-Review (E²R) tree search to unify hierarchical planning and evaluation, ultimately achieving an 84.67% success rate on PRDBench that surpasses the state of the art by 15.48 percentage points.
Key Contributions
- The framework introduces a Talent-Container architecture that decouples portable agent identities from heterogeneous execution backends via six typed organisational interfaces. This design enables dynamic workforce assembly through a community-driven Talent Market that provisions verified agents on demand.
- Project execution is operationalized through an Explore-Execute-Review tree search that unifies hierarchical task decomposition, agent coordination, and outcome evaluation. A DAG-based task structure with AND-tree semantics and a finite state machine provides formal guarantees on termination and deadlock freedom while iteratively refining organisational strategies.
- Organisational improvement is automated through a structured feedback pipeline that updates agent working principles and standard operating procedures based on performance reviews. Quantitative evaluation on the PRDBench benchmark demonstrates an 84.67% success rate, surpassing state-of-the-art baselines by 15.48 percentage points.
Introduction
In software development and complex automation, scaling AI collaboration is critical, yet current multi-agent systems struggle with brittle team structures, runtime incompatibilities, and ad hoc coordination. These systems lack a unifying organizational layer that separates workforce structure from individual capabilities, preventing reliable generalization to open-ended projects. To address this gap, the authors introduce OneManCompany, an open-source framework that formalizes AI organization design through a decoupled talent and container architecture, a dynamic tree search for structured task decomposition, and continuous feedback loops for agent and organizational self-evolution. This approach enables heterogeneous agents to be automatically recruited, coordinated, and improved over time, mirroring the operational principles of human companies to tackle complex, cross-domain workflows.
Dataset
-
Dataset Composition and Sources: The authors leverage PRDBench, a benchmark consisting of 50 project-level tasks drawn from over 20 distinct software development domains. Each task originates from a structured Product Requirement Document supplemented by auxiliary data, comprehensive test plans, and executable evaluation scripts.
-
Subset Details: The collection operates as a unified set of 50 tasks without formal subgroups. Every task is engineered to replicate wild dynamic agentic workflows, meaning team structures, runtime environments, task breakdowns, and execution sequences are intentionally concealed until the agent begins processing.
-
Data Usage and Processing: The authors deploy this dataset exclusively for evaluating the OMC framework. They do not partition the data for training or apply mixture ratios, instead utilizing the complete benchmark to measure end-to-end capabilities in requirement interpretation, hierarchical decomposition, and multi-agent coordination.
-
Workflow and Evaluation Mechanics: The authors rely on the embedded executable scripts and predefined evaluation criteria to automate performance measurement. No cropping strategies or metadata construction pipelines are described, as the benchmark prioritizes dynamic execution traces and script-driven validation to capture realistic development constraints.
Method
The OneManCompany (OMC) framework is designed to model multi-agent systems as self-organizing and self-improving organizations, structured around three core pillars: organizational management, project execution, and organizational evolution. At the foundation of this architecture is the concept of the Employee, which is composed of a portable Talent and a Container. The Talent encapsulates an agent's cognitive identity, including its role, skills, tools, and guiding principles, while the Container provides the runtime environment and the formal interfaces through which the agent interacts with the organizational layer. This Talent-Container architecture enables a decoupling of agent capabilities from their execution backends, allowing for heterogeneous agents—such as LangGraph-based, Claude Code, or script-driven executors—to coexist and be managed uniformly within a single organization. The Container hosts the agent runtime and provides six typed organizational interfaces: Execution, Task, Event, Storage, Context, and Lifecycle, which standardize agent-platform interaction and ensure policy enforcement, isolation, and extensibility. The organizational layer acts as a unifying abstraction, analogous to an operating system kernel, providing a consistent interface over diverse hardware and agent backends, as illustrated in the framework diagram.
The Talent Market serves as a community-driven agent marketplace, enabling on-demand recruitment of verified, benchmark-validated agent packages. These Talents are complete, ready-to-deploy agent packages that include system prompts, role definitions, tool configurations, skill scripts, and domain knowledge, decoupled from any specific Container. The market supports three sourcing channels: community-contributed Talents from open-source repositories, AI-recommended assembly of skills from the web to address cold-start domains, and internal promotion of high-performing employees. When a project requires a capability not present in the current workforce, the HR agent queries the Talent Market, compiles a ranked shortlist based on skill match and community ratings, and presents candidates to the CEO for approval. Upon selection, the system automates the provisioning of a Container, assigns a desk, configures tool access, and registers the new employee, enabling dynamic team assembly without manual setup.
Project execution is governed by the Explore-Execute-Review (E²R) tree search, a hierarchical loop that decomposes tasks top-down into accountable units and aggregates outcomes bottom-up to drive refinement. The E²R operates over a dynamic search tree T=(V,Etree,Edep), where nodes represent organizational states at decision points, carrying attributes such as task description, assigned employee, status, result, and cost. The tree grows through five action types: decompose (adding child tasks), assign (binding an employee to a leaf), recruit (hiring a new employee), review (accepting or rejecting a result), and iterate (creating a new root-level strategy). The policy π(T) selects a strategy for the current decision point, determining how to decompose the task and whom to assign. The exploration stage selects a strategy, the execution stage carries out the plan, and the review stage evaluates the result, producing a quality signal that propagates bottom-up. This accept-or-redecompose cycle continues until the root is resolved or a circuit breaker fires.
The E²R tree search is complemented by a DAG-based execution layer that ensures reliable task completion. The task tree is augmented with dependency edges Edep, forming a DAG that must be acyclic, enforced at insertion time. A node v becomes executable when its dependency constraints are satisfied, meaning all its predecessors are in the ACCEPTED or FINISHED state. The scheduler selects ready nodes in FIFO order, subject to a mutual exclusion invariant, ensuring no employee runs more than one task at a time. The task lifecycle is governed by a finite state machine with states including PENDING, PROCESSING, COMPLETED, ACCEPTED, FAILED, and FINISHED. A key structural guarantee is the AND-semantics: a node is resolved only when all its children are resolved, ensuring that project completion is a derived property of subtask completion. This bottom-up propagation prevents silent stalls and ensures that no task can be silently dropped. The system also implements bounded rationality through circuit breakers, including a review limit, a task timeout, and a cost budget, which guarantee that every search episode terminates in bounded time and cost.
Organizational evolution occurs through a combination of individual and organizational learning mechanisms. Individually, agents maintain a persistent profile that includes a progress log and summarized working principles. After each CEO one-on-one, the agent performs structured self-reflection to update its principles, and upon task completion, it conducts a post-task review to update its log. These updates modify the agent's Talent artefacts, enabling continuous improvement without retraining. At the organizational level, project retrospectives are conducted, in which employees submit self-assessments and the COO aggregates findings into individual feedback and organizational Standard Operating Procedures (SOPs) that are injected into future agent contexts. A formal performance review pipeline ensures accountability: every three projects, the HR agent initiates a review, and employees failing three consecutive reviews enter a Performance Improvement Plan, with offboarding triggered upon a fourth failure. This lifecycle management closes the loop between the Talent Market and organizational evolution, ensuring that underperforming agents are replaced and high-performing agents are continuously refined.
Experiment
The evaluation pairs a standardized software development benchmark with four cross-domain case studies to validate the system’s capacity for autonomous, project-level orchestration. Testing demonstrates that dynamic task decomposition, enforced quality gates, and seamless coordination across heterogeneous model families enable reliable execution without requiring domain-specific configuration. Qualitative analysis across software engineering, game development, multimedia production, and academic research highlights the framework’s effectiveness in supporting iterative human feedback and cross-modal collaboration. Ultimately, the findings indicate that organizational architectures can successfully scale AI agent teamwork to handle diverse, complex workflows while maintaining adaptability and output correctness.
The authors present a multi-agent system that recruits specialized agents from diverse model families to execute complex tasks across software development, game development, audio-visual production, and academic research. The system uses a dynamic task decomposition approach with a review gate and coordinated execution, achieving high success rates and demonstrating adaptability across domains. Results show that the system incurs a cost overhead due to coordination, but this is justified for complex tasks requiring high accuracy. The system achieves high success rates by dynamically decomposing tasks and enforcing a review gate to prevent error propagation. It recruits agents from different model families, enabling cross-modal and cross-domain coordination. The system incurs a cost overhead due to multi-agent coordination, which is justified for complex, project-level tasks.
The authors compare OMC with existing systems across multiple dimensions including design paradigm, execution model, and organizational evolution. Results show that OMC stands out by supporting multi-family agent coordination and dynamic organizational evolution, unlike other systems that rely on fixed architectures or lack self-evolution capabilities. The the the table highlights OMC's unique combination of on-demand organization and flexible agent recruitment. OMC supports multi-family agent coordination and dynamic organizational evolution, which are absent in other systems OMC uses an on-demand organization paradigm with typed interfaces, differing from sequential or distributed models used by baselines Other systems lack self-evolution capabilities and rely on fixed agent sources, whereas OMC enables dynamic talent recruitment from a market
The authors compare the performance of various agent systems on a software development benchmark, evaluating success rate and cost. Results show that the proposed multi-agent approach achieves the highest success rate and incurs a significant cost, reflecting the overhead of coordinated execution across specialized agents. The multi-agent system achieves the highest success rate compared to all baseline methods. The proposed method incurs a higher cost than other systems, indicating greater resource usage for coordination. The success rate improvement is substantial relative to the best-performing baseline, demonstrating enhanced effectiveness.
The evaluation encompasses complex cross-domain projects in software, game, and media development alongside comparative benchmarks against established agent frameworks. These experiments validate the system's ability to dynamically decompose tasks, coordinate specialized models across different families, and enforce quality control through structured review gates. Comparative analysis confirms that the proposed architecture surpasses fixed-baseline systems by supporting flexible, on-demand organizational evolution and seamless cross-modal coordination. Ultimately, the findings demonstrate that although multi-agent coordination increases computational overhead, the significant improvements in adaptability and success rates justify the trade-off for complex, project-level applications.