HyperAIHyperAI

Command Palette

Search for a command to run...

Agent = Model + Harness

New research challenges the prevailing focus on foundation models in AI development by arguing that system architecture, or the "harness," is equally critical to agent performance. The study introduces a framework encapsulated in the equation Agent = Model + Harness, proposing that an agent's success depends as much on the software layer managing context, tools, and state as on the underlying language model itself. To test this hypothesis, researchers developed Harness-Bench, a comprehensive evaluation involving 106 sandboxed tasks, eight model backends, and six distinct harness configurations, generating over 5,000 execution trajectories. The results revealed that swapping the harness while keeping the same model and task set caused score fluctuations of up to 23.8 points. For instance, NanoBot achieved a score of 76.2, while OpenClaw scored 52.4 using the same model infrastructure. This variance confirms that the system layer often determines outcomes more than model capability. Analysis of failure modes revealed a surprising insight: the primary cause of agent failure is rarely a lack of reasoning or intelligence. Instead, most errors stem from execution alignment issues. Contract violations and malformed outputs accounted for 36.4% of failures, followed by tool errors where the system failed to recover from blockages. The study found that models frequently generated correct thoughts but failed to commit them to the environment as verifiable artifacts. This suggests the bottleneck is not cognitive ability but rather the harness's ability to translate reasoning into concrete actions and maintain accurate bookkeeping. The concept of execution alignment defines the degree to which a harness preserves correspondence between intention and verified completion. When this alignment holds, plausible reasoning becomes real work; when it breaks, the agent drifts. The study notes a complex relationship between model strength and harness dependency. Weak models rely heavily on robust harnesses, with performance swinging wildly based on the system layer. As models become stronger, they become more resilient to harness variations, absorbing the substrate and reducing the relative gap between different system configurations. This implies that the value of a sophisticated harness decays as the underlying intelligence improves, though it remains vital for weaker models. Efficiency also played a crucial role in the benchmark outcomes. NanoBot, an ultra-lightweight agent with a minimal core loop, outperformed heavier stacks like Hermes, which burned significantly more tokens and took longer to complete tasks despite offering more features. The highest score of 80.4 was achieved by Codex, a specialized model-bound agent, indicating that specialized, tight execution loops often beat broad, flexible architectures. The six evaluated harnesses ranged from the lightweight NanoBot to feature-rich platforms like OpenClaw and secure runtimes like Moltis, highlighting that the most effective configurations are those that prioritize simplicity and fidelity over complexity. Ultimately, the findings urge developers to shift focus from solely optimizing models to engineering harnesses that ensure execution alignment. The critical question is no longer how many tools a system exposes, but whether it can faithfully tether intention to verifiable completion. Building a system that merely bridges the gap for a weak model may be insufficient if that model eventually outgrows the need for such a scaffold. The most successful agents appear to be those that combine a small, legible loop with rigorous state management, proving that in the race for AI performance, bookkeeping is just as important as brainpower.

Related Links