Command Palette
Search for a command to run...
Loop-Engineering: Der Anthropic-Leitfaden zur Gestaltung von Systemen, die Ihre Agenten anweisen
Loop-Engineering: Der Anthropic-Leitfaden zur Gestaltung von Systemen, die Ihre Agenten anweisen
Peter Steinberger Boris Cherny Addy Osmani
Zusammenfassung
In den letzten zwei Jahren hat eine Reihe von Begriffen im Format „[Bezeichnung] Engineering“ das Tempo der Modellveröffentlichungen nachgezeichnet. Dieser Beitrag untersucht den neuesten darunter, das sogenannte Loop Engineering, ein Begriff, der im Juni 2026 unabhängig von Peter Steinberger, Boris Cherny und Addy Osmani in die Diskussion eingebracht und von Osmani schriftlich geprägt wurde. Im Gegensatz zu Prompt Engineering, Context Engineering oder Harness Engineering lehrt das Loop Engineering die Praktizierenden nicht, ihre Arbeiten besser zu erledigen; vielmehr eliminiert es die Ausführenden vollständig aus dem Prozess der Arbeitserledigung. Wir definieren den Begriff, ordnen ihn als vierte Schicht oberhalb des Harnesses ein und zergliedern einen einzigen Durchlauf einer Loop in fünf Schritte – Discovery, Handoff, Verification, Persistence und Scheduling – sowie in die sechs Komponenten, die diese Schritte implementieren. Dabei widmen wir der Trennung von Generator und Evaluierer besondere Aufmerksamkeit: Empirische Befunde zeigen, dass ein Agent, der aufgefordert wird, seine eigenen Ausgaben zu bewerten, tendenziell eine positive Tendenz aufweist. Die Feinabstimmung eines unabhängigen, skeptisch agierenden Evaluierers ist weitaus handhabbarer, als einen Generator dazu zu bringen, kritisch mit seiner eigenen Arbeit umzugehen. Wir stellen drei in der Praxis betriebene Loops vor, angefangen vom täglichen Triage-Management eines einzelnen Ingenieurs bis hin zu der unternehmensweiten Pipeline von Stripe, die wöchentlich über 1.300 maschinell erstellte Pull Requests zusammenführt. Zudem katalogisieren wir vier Kostenfaktoren, die sich stillschweigend akkumulieren: Verification Debt, Comprehension Rot, Cognitive Surrender und Token Blowout. Wir schließen mit einer konkreten Handlungsanleitung zum Aufbau einer ersten Loop. Zentrale Aussage ist, dass Loops die Generierung nahezu kostenlos machen und die Urteilskraft zur knappen Ressource werden lassen; dieselbe Loop kann, je nach den beteiligten Personen, zu gegensätzlichen Ergebnissen führen.
One-sentence Summary
In this note, Peter Steinberger, Boris Cherny, and Addy Osmani introduce Loop Engineering as a fourth layer above harness engineering that removes practitioners from performing work by designing self-prompting agent loops, decomposing each turn into discovery, handoff, verification, persistence, and scheduling, crucially separating generator from evaluator because agents grading their own output tend to self-praise, and surveying real-world loops from a personal morning triage to Stripe’s pipeline merging over 1,300 machine-written pull requests per week, demonstrating that loops make generation nearly free while judgment becomes the scarce resource and the same loop can produce opposite outcomes in different hands.
Key Contributions
- The note defines loop engineering as a fourth layer above harness engineering, decomposing a single loop turn into five moves (discovery, handoff, verification, persistence, scheduling) and six constituent parts.
- It introduces a generator/evaluator separation, empirically showing that agents overpraise their own outputs and that an independently tuned skeptical evaluator is far more tractable than making a generator self-critical.
- The note surveys three real-world loops, catalogs four hidden costs (verification debt, comprehension rot, cognitive surrender, token blowout), provides a concrete build recipe, and establishes that loops make generation nearly free, concentrating engineering value into judgment as the scarce resource.
Introduction
The authors examine a new paradigm called Loop Engineering, which shifts the practitioner from directly prompting AI coding agents to designing autonomous systems that prompt themselves. This matters because earlier approaches—prompt, context, and harness engineering—all kept a human in the loop, limiting scalability and requiring constant attention. The key limitation of prior work is that the human must act as the clock and decision-maker, unable to step away. The authors’ main contribution is a formal definition of loop engineering, a decomposition of a loop’s turn into five moves (discovery, handoff, verification, persistence, and scheduling), and an emphasis on the generator/evaluator split to maintain judgment while automating generation.
Method
Theauthors propose a hierarchical framework for engineering AI agents, culminating in a self-running loop architecture. This framework stacks four distinct layers, each expanding the scope of concern. As shown in the figure below, the stack progresses from Prompt Engineering at the base, through Context and Harness Engineering, to Loop Engineering at the top.
Prompt Engineering manages the wording for a single exchange. Context Engineering curates the model's field of view. Harness Engineering equips a single run with tools and actions. Loop Engineering automates the entire process, allowing the system to wake on a schedule, spawn sub-agents, and feed its own output back as input for subsequent rounds.
A functional loop executes a concrete cycle of five moves rather than spinning idly. As illustrated in the diagram below, these moves form a continuous turn that feeds the next iteration.
First, Discovery identifies work worth doing, such as reading CI failures, allowing the agent to find its own tasks. Second, Handoff moves the task to an isolated environment, like a git worktree, to prevent collisions during parallel execution. Third, Verification checks the result, serving as the critical mechanism to reject poor output. Fourth, Persistence saves state to disk so the loop survives context window clearing. Finally, Scheduling triggers the next turn automatically.
To enable these moves, the architecture relies on six structural parts. Automations trigger the loop based on time or events. Worktrees provide isolation for parallel agents. Skills store permanent project knowledge to reduce intent debt. Connectors link the loop to external tools via protocols like MCP. Sub-agents split the writer from the judge. Memory ensures state persists across days outside the conversation window.
The most critical architectural decision involves the verification module. The authors note that agents tend to praise their own work, leading to a nodding loop where errors accumulate. To solve this, the framework leverages a Maker-Checker principle. As shown in the figure below, the architecture structurally splits the agent into a Generator and an Evaluator.
The Generator writes the code. The Evaluator, often a different model instructed to assume the code is broken, reviews it. Crucially, the Evaluator acts by running tests or inspecting the DOM rather than just reading code.
The stop condition is managed by a fresh model checking if a specific goal is met. The code snippet below demonstrates this logic, where a small fast model checks the condition after each turn.
For large-scale reliability, the authors describe the Stripe Minions pipeline. This architecture interleaves deterministic gates with probabilistic LLM steps. As depicted in the pipeline diagram, the process begins with a human trigger, followed by a deterministic orchestrator assembling context.
The LLM agent writes code, but a hard-coded gate runs immediately after; the agent cannot skip this step. If the lint fails, the agent fixes it. Finally, a hard-coded step commits the code, followed by human review. This structure ensures reliability comes from the quality of constraints rather than just model size.
Experiment
The evaluation contrasts local loop/desktop scheduled tasks with cloud routines and GitHub Actions schedule triggers for running background work while the user sleeps. Local scheduling demands that the machine remain powered on but enables frequent execution and direct access to local files, whereas cloud scheduling runs untethered from local state at the cost of a one-hour minimum interval and a clean clone each time. The comparison shows that no single scheduler meets all requirements, and it warns that widely circulated secondhand metrics should be treated as rough references, highlighting the greater reliability of firsthand sources.