HyperAIHyperAI

Command Palette

Search for a command to run...

ハネス更新はハネス利益ではない:自己進化型LLMエージェントにおける進化能力の解離

概要

LLMエージェントは、モデルのパラメータを変更せずにタスク実行を制御する、プロンプト、スキル、メモリ、ツールなどから構成される編集可能な外部ハーネスを中心に構築されたシステムとして、ますます広く導入されています。ハーネス自己進化(Harness Self-Evolution)は、実行記録(execution evidence)に基づいてこれらのハーネスを更新することで、そのようなエージェントを適応・強化します。しかし、モデルのタスク解決における基礎能力(base capability)が、そのハーネス自己進化能力を予測できるかどうか、すなわち、どのモデルが有用なハーネス更新を生成し、どのモデルがそれらの更新から実際に恩恵を受けるかは、まだ明確になっていません。本研究では、二つのハーネス自己進化能力を分析します。(i) ハーネス更新能力(harness-updating):実行記録に基づいて、有用で持続的なハーネス更新を生成する能力。(ii) ハーネス効果享受能力(harness-benefit):タスク解決において更新されたハーネスから利益を得る能力。私たちの分析から、二つの知見が得られました。第一に、ハーネス更新能力は基礎能力に対して平坦(flat)です:異なる能力階層のモデルは、驚くほど類似した利益をもたらすハーネス更新を生成します。例えば、Qwen3.5-9Bの更新がもたらす利益は、Claude Opus 4.6のそれと同等です。第二に、ハーネス効果享受能力は基礎能力に対して単調ではありません:弱層(weak-tier)のモデルは更新ハーネスからほとんど恩恵を受けず、中層(mid-tier)のモデルが最も恩恵を受け、強層(strong-tier)のモデルは中層ほど恩恵を受けません。弱層での低い利益は、二つの失敗モードに起因します:弱層モデルは関連するハーネスアーティファクトを激活できない、あるいは激活してもそれらを忠実に遵守できないことがあることが明らかになりました。これらの知見は、エージェントのトレーニングにおいて、進化機構(evolver)ではなくタスク解決エージェント側に能力予算を投資し、エージェントのトレーニングではハーネス呼び出し(harness invocation)と長期にわたる指示従順性(long-horizon instruction following)を重点的に目標とすべきであることを示唆しています。本調査のソースコードは、以下の公開リンクから入手可能です:here

One-sentence Summary

This analysis disentangles harness-updating from harness-benefit in self-evolving LLM agents to reveal that harness-updating yields similar gains across tiers, with Qwen3.5-9B updates yielding gains comparable to those of Claude Opus 4.6, whereas harness-benefit is non-monotonic with mid-tier models benefiting most, suggesting capability budgets should prioritize task-solving agents over evolvers.

Key Contributions

  • This work introduces a controlled capability analysis framework that separates harness-updating from harness-benefit to measure these capabilities separately. The methodology varies agents and evolvers independently to isolate specific capabilities rather than conflating end-to-end gains.
  • Experiments demonstrate that harness-updating capability remains flat across base model tiers, producing similar gains regardless of underlying model size. Results show that updates from smaller models like Qwen3.5-9B yield performance improvements comparable to those from stronger models like Claude Opus 4.6.
  • The analysis reveals that harness-benefit capability follows a non-monotonic pattern where mid-tier models benefit most while weak-tier models fail to activate or follow harness artifacts faithfully. These findings suggest prioritizing the task-solving agent over the evolver and targeting harness invocation and long-horizon instruction following in agent training.

Introduction

Large language model agents increasingly rely on editable external harnesses such as prompts, tools, and memory to shape behavior without updating model weights. While harness self-evolution allows these systems to adapt by refining artifacts based on execution evidence, prior evaluations often conflate the agent's base performance with the quality of updates and the agent's ability to utilize them. The authors disentangle these factors by distinguishing between harness-updating, the ability to produce useful artifacts, and harness-benefit, the ability to leverage them during task solving. Their analysis reveals that updating capability remains consistent across model tiers while benefit capability peaks at mid-tier levels, suggesting developers should prioritize investing in the task-solving agent rather than the evolver.

Dataset

  • Dataset Composition: The authors evaluate on three benchmarks: SWE-bench Verified for software engineering, MCP-Atlas for tool use over real servers, and SkillsBench for skill-based execution.
  • Subset Details: SWE-bench Verified contains 500 human-validated tasks from 12 Python repositories where patches must pass hidden tests. MCP-Atlas includes 500 tasks requiring orchestration across 36 servers using a claims-based scoring rubric. SkillsBench offers 86 tasks across 11 domains with deterministic verifiers.
  • Evaluation Protocol: The team employs an in-situ setting where tasks are scored under the previous harness before evidence updates the current one. Pass rate is the primary metric aggregated over the task stream.
  • Processing and Constraints: Harness editing is benchmark-specific. SWE-bench and SkillsBench restrict edits to the skills directory while MCP-Atlas allows changes to prompts and memory files. All models use fixed prompt templates and evolution budgets to maintain fair comparison.

Method

The authors leverage a harness self-evolution protocol designed to adapt an LLM agent by updating the external harness surrounding a fixed model during task execution. In this framework, the agent attempts a stream of tasks, and the harness is iteratively updated based on the agent's execution evidence. This approach distinguishes between the parametric model backbone and the non-parametric context, allowing for continuous improvement without retraining the underlying LLM.

Refer to the framework diagram to visualize the overall architecture. The system consists of a Frozen LLM connected to a dynamic Harness containing Memory, Tools, Prompts, and Skills. The process operates as a closed loop: execution experiences are gathered and subjected to Diagnosis, which feeds into an Evolver Model. The Evolver Model then generates updates that modify the Harness components, such as injecting new skills or refining prompts, to better handle future tasks.

Formally, at evolution step ttt, the agent is defined as At=(f,Ht)A_t = (f, H_t)At=(f,Ht), where fff represents the fixed model backbone and HtH_tHt denotes the harness state. The system adheres to a protocol where fff remains constant while editable components of HtH_tHt (e.g., prompts, skills, memories) are updated. The evolver eee functions as the update procedure that converts execution evidence into harness modifications. Given the previous harness Ht1H_{t-1}Ht1 and accumulated execution evidence Dt\mathcal{D}_tDt, the evolver proposes an update ΔHt\Delta H_tΔHt and applies it to obtain the next harness state:

ΔHt=e(Ht1,Dt),Ht=Apply(Ht1,ΔHt).\begin{array} { r } { \Delta H _ { t } = e ( H _ { t - 1 } , \mathcal { D } _ { t } ) , } \\ { H _ { t } = \mathrm { A p p l y } ( H _ { t - 1 } , \Delta H _ { t } ) . } \end{array}ΔHt=e(Ht1,Dt),Ht=Apply(Ht1,ΔHt).

The evolution protocol follows an iterative loop over TTT steps. At each step, the agent At1A_{t-1}At1 attempts to solve a batch of tasks Xt\mathcal{X}_tXt, outputting execution trajectories τt,x\tau_{t,x}τt,x and final outputs yt,xy_{t,x}yt,x. The evidence Dt\mathcal{D}_tDt is collected from these interactions, and the evolver produces the updated harness HtH_tHt for the subsequent step.

As shown in the figure below, the practical application of this method involves the evolver diagnosing failures and injecting specific procedural skills into the harness. The comparison illustrates a scenario where an agent without an evolver fails to complete a Flink job task due to missing steps. In contrast, agents utilizing a Qwen3.5-9B or Opus4.6 Evolver successfully pass the task. This success is attributed to the evolver auto-injecting a flink-query skill into the loaded skills list. The figure highlights that while different evolver models may produce procedurally isomorphic recipes with similar logic, the specific details and phrasing of the injected skills differ, yet both lead to a successful outcome where the base agent would have failed.

Experiment

This study evaluates harness self-evolution across seven LLMs and three agentic benchmarks by decoupling harness-updating, where models generate improvements from execution evidence, from harness-benefit, where models utilize those updates during task solving. Experiments reveal that harness-updating capability is independent of base model strength, while harness-benefit follows a non-monotonic pattern where mid-tier models gain the most and weaker models struggle due to failures in activating artifacts or adhering to guidance. These findings suggest that system design should prioritize investing capability in the task-solving agent and training it to reliably invoke and follow external harness instructions.

The authors evaluate harness-updating capability by pairing various evolver models with fixed task-solving agents across three benchmarks. Results indicate that the ability to generate useful harness updates is relatively consistent across different model tiers, with no single evolver dominating all tasks. Furthermore, the final system performance is driven more by the base capability of the task-solving agent than by the specific evolver model used. Harness-updating gains remain stable across evolver capability tiers, showing that smaller models can produce updates comparable to larger ones. No evolver model consistently outperforms others across all benchmarks, indicating a reshuffling of effectiveness depending on the task domain. The base capability of the task-solving agent is the dominant factor in post-evolution performance, while the evolver identity contributes less variance.

The authors analyze per-phase adherence scores to diagnose why weak-tier models derive low benefit from harness updates. The results show that while strong models maintain stable adherence throughout task execution, weaker models suffer from significant drift as the trajectory unfolds. This indicates a long-horizon instruction-following bottleneck where adherence decays much more steeply for weaker models compared to their stronger counterparts. Strong models maintain the highest adherence scores across all trajectory phases. Weak models exhibit the largest negative drift in adherence from loading to the final turn. Mid-tier models show moderate adherence decay, falling between weak and strong model performance.

The authors analyze agent-side capabilities in harness self-evolution by measuring activation, adherence, and success rates across models of varying capabilities. The data reveals that weaker models suffer from both low skill-load rates and poor adherence to loaded instructions, whereas stronger models consistently activate and follow harness artifacts. Notably, mid-capability models show high activation rates but struggle with adherence, indicating that loading a skill does not guarantee effective utilization. Weak-tier models exhibit significantly lower skill-load rates compared to strong-tier models, failing to activate harness artifacts in the majority of trajectories. Strong-tier models demonstrate superior adherence to loaded skills, maintaining high following rates while weaker models often deviate from instructions. A disparity exists between activation and adherence for mid-tier models, where high skill-load rates do not necessarily translate to high instruction-following performance.

The authors analyze harness-benefit capability across seven LLMs and three benchmarks to determine which models benefit most from updated agent harnesses. Results indicate that the improvement from harness evolution is non-monotonic with respect to base model capability, where mid-tier models show the largest gains while both weaker and stronger models benefit less. This pattern suggests that weak models struggle to activate or adhere to harness instructions, whereas strong models face diminishing returns due to performance ceilings. Mid-tier models demonstrate the highest improvement from harness evolution compared to weaker or stronger counterparts. Stronger models experience limited gains as they approach performance ceilings on the benchmarks. Weaker models fail to leverage harness updates effectively due to difficulties in activating and following procedural guidance.

The authors analyze extreme pairings where the strongest task-solving agent uses its worst evolver and the weakest agent uses its best evolver. Results indicate that the strong agent consistently outperforms the weak agent by a significant margin across all benchmarks, demonstrating that the agent's base capability is the primary determinant of post-evolution performance. Strong agents maintain a substantial performance lead over weak agents even when paired with their least effective evolvers. The performance gap favors the strong agent across all tested benchmarks, highlighting the dominance of agent capability over evolver quality. Results suggest that optimizing the task-solving agent yields greater returns than optimizing the model responsible for generating harness updates.

The authors evaluate harness-updating capabilities by pairing various evolver models with fixed task-solving agents across multiple benchmarks. Findings reveal that the base capability of the task-solving agent is the primary determinant of performance, outweighing the specific evolver model used. Although mid-tier models benefit most from harness evolution, weaker models struggle with instruction adherence while stronger models face diminishing returns, indicating that optimizing the agent yields greater returns than refining the evolver.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています