Command Palette
Search for a command to run...
ループエンジニアリング:Anthropicによるエージェントへのプロンプト設計のためのシステム構築プレイブック
ループエンジニアリング:Anthropicによるエージェントへのプロンプト設計のためのシステム構築プレイブック
Peter Steinberger Boris Cherny Addy Osmani
概要
過去2年間、モデルのリリースペースを追跡するように「XXエンジニアリング」という用語が次々と登場してきた。本稿では、その中でも最新である「Loop Engineering(ループエンジニアリング)」について考察する。この用語は2026年6月にピーター・シュタインバーガー(Peter Steinberger)、ボリス・チェルニ(Boris Cherny)、アディ・オスマニ(Addy Osmani)の3人によって独立して提起され、文章ではオスマニによって命名された。プロンプト、コンテキスト、ハルネスのエンジニアリングが、実践者が仕事をよりよく行うための技法を教授するものであるのに対し、ループエンジニアリングは実践者が仕事を遂行する立場そのものから外れることを可能にするものである。我々は本用語を定義し、ハルネスの上位に位置する第4の層として位置づける。さらに、ループの単一ターン(1回の往復)を、発見(discovery)、引き渡し(handoff)、検証(verification)、永続化(persistence)、スケジューリング(scheduling)の5つのアクションに分解し、それらを実現する6つの構成要素について詳述する。特に生成器と評価器の分離に注目する。実証的には、自身の出力を採点するよう求められたエージェントは、その出力を褒め称けがちであるという。したがって、自身に批判的な生成器を作らせることよりも、独立した懐疑的な評価器をチューニングするほうが、はるかに扱いやすい。我々は、1人のエンジニアの朝のトリアージから、Stripeが週に1,300件以上の機械生成されたPull Requestをマージするエンタープライズ規模のパイプラインに至るまで、実際に運用されている3つのループを調査する。また、静かに蓄積していく4つのコスト——検証負債(verification debt)、理解の腐敗(comprehension rot)、認知の委譲(cognitive surrender)、トークン爆発(token blowout)——をカタログ化して提示する。最後には、最初のループを構築するための具体的なレシピを示す。本稿の中心的な主張は、ループによって生成はほぼ無コストになり、判断力だけが希少資源に残るということである。同じループであっても、2人の人間が構築すれば、全く逆の結果をもたらす可能性がある。
One-sentence Summary
In this note, Peter Steinberger, Boris Cherny, and Addy Osmani introduce Loop Engineering as a fourth layer above harness engineering that removes practitioners from performing work by designing self-prompting agent loops, decomposing each turn into discovery, handoff, verification, persistence, and scheduling, crucially separating generator from evaluator because agents grading their own output tend to self-praise, and surveying real-world loops from a personal morning triage to Stripe’s pipeline merging over 1,300 machine-written pull requests per week, demonstrating that loops make generation nearly free while judgment becomes the scarce resource and the same loop can produce opposite outcomes in different hands.
Key Contributions
- The note defines loop engineering as a fourth layer above harness engineering, decomposing a single loop turn into five moves (discovery, handoff, verification, persistence, scheduling) and six constituent parts.
- It introduces a generator/evaluator separation, empirically showing that agents overpraise their own outputs and that an independently tuned skeptical evaluator is far more tractable than making a generator self-critical.
- The note surveys three real-world loops, catalogs four hidden costs (verification debt, comprehension rot, cognitive surrender, token blowout), provides a concrete build recipe, and establishes that loops make generation nearly free, concentrating engineering value into judgment as the scarce resource.
Introduction
The authors examine a new paradigm called Loop Engineering, which shifts the practitioner from directly prompting AI coding agents to designing autonomous systems that prompt themselves. This matters because earlier approaches—prompt, context, and harness engineering—all kept a human in the loop, limiting scalability and requiring constant attention. The key limitation of prior work is that the human must act as the clock and decision-maker, unable to step away. The authors’ main contribution is a formal definition of loop engineering, a decomposition of a loop’s turn into five moves (discovery, handoff, verification, persistence, and scheduling), and an emphasis on the generator/evaluator split to maintain judgment while automating generation.
Method
Theauthors propose a hierarchical framework for engineering AI agents, culminating in a self-running loop architecture. This framework stacks four distinct layers, each expanding the scope of concern. As shown in the figure below, the stack progresses from Prompt Engineering at the base, through Context and Harness Engineering, to Loop Engineering at the top.
Prompt Engineering manages the wording for a single exchange. Context Engineering curates the model's field of view. Harness Engineering equips a single run with tools and actions. Loop Engineering automates the entire process, allowing the system to wake on a schedule, spawn sub-agents, and feed its own output back as input for subsequent rounds.
A functional loop executes a concrete cycle of five moves rather than spinning idly. As illustrated in the diagram below, these moves form a continuous turn that feeds the next iteration.
First, Discovery identifies work worth doing, such as reading CI failures, allowing the agent to find its own tasks. Second, Handoff moves the task to an isolated environment, like a git worktree, to prevent collisions during parallel execution. Third, Verification checks the result, serving as the critical mechanism to reject poor output. Fourth, Persistence saves state to disk so the loop survives context window clearing. Finally, Scheduling triggers the next turn automatically.
To enable these moves, the architecture relies on six structural parts. Automations trigger the loop based on time or events. Worktrees provide isolation for parallel agents. Skills store permanent project knowledge to reduce intent debt. Connectors link the loop to external tools via protocols like MCP. Sub-agents split the writer from the judge. Memory ensures state persists across days outside the conversation window.
The most critical architectural decision involves the verification module. The authors note that agents tend to praise their own work, leading to a nodding loop where errors accumulate. To solve this, the framework leverages a Maker-Checker principle. As shown in the figure below, the architecture structurally splits the agent into a Generator and an Evaluator.
The Generator writes the code. The Evaluator, often a different model instructed to assume the code is broken, reviews it. Crucially, the Evaluator acts by running tests or inspecting the DOM rather than just reading code.
The stop condition is managed by a fresh model checking if a specific goal is met. The code snippet below demonstrates this logic, where a small fast model checks the condition after each turn.
For large-scale reliability, the authors describe the Stripe Minions pipeline. This architecture interleaves deterministic gates with probabilistic LLM steps. As depicted in the pipeline diagram, the process begins with a human trigger, followed by a deterministic orchestrator assembling context.
The LLM agent writes code, but a hard-coded gate runs immediately after; the agent cannot skip this step. If the lint fails, the agent fixes it. Finally, a hard-coded step commits the code, followed by human review. This structure ensures reliability comes from the quality of constraints rather than just model size.
Experiment
The evaluation contrasts local loop/desktop scheduled tasks with cloud routines and GitHub Actions schedule triggers for running background work while the user sleeps. Local scheduling demands that the machine remain powered on but enables frequent execution and direct access to local files, whereas cloud scheduling runs untethered from local state at the cost of a one-hour minimum interval and a clean clone each time. The comparison shows that no single scheduler meets all requirements, and it warns that widely circulated secondhand metrics should be treated as rough references, highlighting the greater reliability of firsthand sources.