HyperAIHyperAI

Command Palette

Search for a command to run...

Harness-1 : Apprentissage par renforcement pour les agents de recherche avec des harnais externalisant l’état

Pengcheng Jiang Zhiyi Shi Kelly Hong Xueqiang Xu Jiashuo Sun Jimeng Sun Hammad Bashir Jiawei Han

Résumé

Les agents de recherche sont souvent entraînés comme des politiques opérant sur des transcriptions croissantes : le modèle doit décider de la stratégie de recherche tout en mémorisant ce qu’il a observé, en identifiant les preuves utiles, en gardant trace des contraintes encore ouvertes et en vérifiant les affirmations déjà contrôlées. Nous soutenant que cette formulation charge la politique d’une gestion d’état trop routinière : l’apprentissage par renforcement (RL) est contraint d’optimiser à la fois les décisions sémantiques de recherche et la tenue de registres récupérable que l’environnement pourrait maintenir plus fiablement. Nous présentons Harness-1, un agent de recherche de 20 milliards de paramètres (sous-agent de récupération) entraîné par RL au sein d’un cadre de recherche avec état (stateful search harness). Le harness maintient une mémoire de travail côté environnement, incluant un pool de candidats, un ensemble curaté avec balises d’importance, des liens de preuves compactés, des registres de vérification, des observations compressées et dédoublonnées, ainsi qu’un rendu contextuel conscient du budget. La politique conserve les décisions sémantiques : quoi rechercher, quels documents conserver ou rejeter, quoi vérifier et quand s’arrêter. Sur huit benchmarks de récupération couvrant le web, la finance, les brevets et la question-réponse multi-sauts (multi-hop QA), Harness-1 atteint une recall curatée moyenne de 0,730, surpassant le sous-agent de recherche open source le plus performant suivant de 11,4 points, tout en restant compétitif face à des chercheurs de type frontier-model bien plus grands.

One-sentence Summary

Harness-1 is a 20B search agent trained with reinforcement learning inside a stateful search harness that externalizes routine state management to the environment while the policy retains semantic decisions, achieving 0.730 average curated recall across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA and outperforming the next strongest open search subagent by 11.4 points.

Key Contributions

  • The paper introduces Harness-1, a 20B search agent trained with reinforcement learning inside a stateful search harness. This formulation moves routine state management out of the policy and into the environment for more reliable bookkeeping.
  • The harness maintains environment-side working memory including a candidate pool, an importance-tagged curated set, compact evidence links, and verification records. The policy retains semantic decisions such as what to search, which documents to keep, and when to stop.
  • Harness-1 achieves 0.730 average curated recall across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA. The method outperforms the next strongest open search subagent by 11.4 points and remains competitive with much larger frontier-model searchers.

Introduction

Search agents typically operate as policies over growing transcripts where they must manage memory and evidence while deciding how to search. This formulation places excessive routine state management inside the policy, forcing reinforcement learning to optimize both semantic search decisions and recoverable bookkeeping that the environment could maintain more reliably. The authors introduce Harness-1, a 20B search agent trained with reinforcement learning inside a stateful search harness. This harness maintains environment-side working memory including candidate pools and verification records, allowing the policy to retain only semantic decisions like what to search or verify. This separation enables Harness-1 to outperform existing open search subagents by 11.4 points on average across eight retrieval benchmarks while remaining competitive with larger frontier models.

Dataset

  • Dataset Composition and Sources: The authors generate supervised fine-tuning trajectories using GPT-5.4 running natively inside the Harness-1 harness. The data spans encyclopedic web search, finance filings, patents, and multi-hop question answering domains.
  • Subset Details: Raw quotas target 300 BC+, 250 SEC, 150 Patents, and 150 Web samples, alongside simplified variants. A 0.10 recall gate filters out low-quality trajectories, resulting in a final corpus of 899 trajectories.
  • Training Usage: The 899 trajectories undergo per-trajectory expansion into turn-conditional datums to yield approximately 26K training examples. This data is used to train the policy on tool selection and document management.
  • Processing and Metadata: Search observations are processed with BM25 sentence compression limited to the top-4 sentences per chunk. Content deduplication applies MinHash-LSH with 64 permutations and a 0.85 threshold. An auto-populate feature adds the top-8 reranked results to the curated set after the first search, marked with an [AUTO-POPULATED] tag.

Method

The authors introduce Harness-1, a retrieval agent architecture designed around the principle of stateful cognitive offloading. In this framework, the policy is relieved of the burden of maintaining search state, allowing it to focus on semantic decisions such as query formulation and evidence selection. The environment maintains a persistent working memory that evolves over the course of an episode.

Refer to the framework diagram above to see the interaction between the policy and the harness. The system operates as a state machine where the transition is defined as (st,at)(st+1,ot+1)(s_t, a_t) \mapsto (s_{t+1}, o_{t+1})(st,at)(st+1,ot+1). The harness state StS_tSt comprises a candidate pool for uncurated documents, a curated set capped at 30 items with importance tags, an evidence graph summarizing cross-document entity links, and a verification cache. A full-text store retains all retrieved chunks for later inspection without cluttering the prompt.

The policy πθ\pi_\thetaπθ, instantiated as a 20B parameter model, observes a rendered version of this state known as the Working Memory. Based on this observation, the model emits a single structured action ata_tat per turn. These actions are categorized into retrieval, inspection, curation, verification, and termination. Retrieval actions like fan_out_search and search_corpus bring new evidence into the candidate pool. Curation actions allow the policy to add or remove documents from the curated set and assign importance levels such as very_high, high, fair, or low. Verification actions enable the policy to write claims and check them against the full-text store.

To ensure efficient training and inference, the harness employs derived-state rendering. Search observations are compressed using BM25 sentence selection and deduplicated by content fingerprint before reaching the prompt. The rendered observation Ot+1O_{t+1}Ot+1 includes a header with the query, the curated set grouped by importance, the candidate pool, search history, and the evidence graph. This compact representation prevents context overflow while preserving actionable information.

The training pipeline consists of two stages. First, Supervised Fine-Tuning (SFT) is performed on trajectories generated by a teacher agent. This stage teaches the model the correct tool-call formats and the rhythm of search followed by curation. Second, Reinforcement Learning (RL) is applied using the CISPO algorithm. The RL process optimizes the policy over full search episodes using a terminal reward signal. This reward combines set-level quality metrics, trajectory coverage, answer evidence bonuses, tool diversity incentives, and turn penalties to encourage efficient and accurate search behavior.

Experiment

Harness-1 was evaluated across eight diverse retrieval benchmarks using recall-oriented metrics to assess evidence coverage against open and frontier baselines. The agent demonstrates stronger gains on held-out transfer tasks than on source-family benchmarks, validating that it learns general search operations rather than dataset-specific patterns. Ablation studies reveal that disabling harness mechanisms causes the policy to revert to shallow search modes, while modular RAG tests show these curated sets improve downstream answer accuracy.

The authors introduce Harness-1, a 20B search agent that achieves the strongest average recall among open-source retrieval models while remaining competitive with larger frontier models. The results highlight that the agent's performance relies heavily on a stateful harness design, which enables better generalization on held-out transfer benchmarks compared to training data sources. Ablation experiments further demonstrate that specific harness mechanisms are essential for high-quality evidence curation, as removing them leads to significant performance degradation. Harness-1 outperforms other open-source agents and most frontier models in average curated recall, with only Opus-4.6 performing better on average. The model exhibits superior transfer capabilities, showing larger performance gains on benchmarks excluded from training compared to those included in the training set. Disabling individual harness mechanisms results in consistent drops in recall and final-answer accuracy, validating the importance of the stateful interface.

The the the table presents an ablation study of the Harness-1 system on the BrowseComp+ benchmark, isolating the impact of individual inference-time mechanisms. The results demonstrate that most components, such as importance tagging and evidence graphs, are critical for maintaining search quality, as disabling them leads to performance degradation. Furthermore, removing all harness mechanisms simultaneously causes a more severe decline in recall than any single ablation, confirming that the stateful interface is fundamental to the agent's ability to curate evidence effectively. Disabling most individual harness mechanisms results in consistent decreases in Final-Answer Recall. The cumulative effect of the harness is significant, with the full system outperforming the version with all mechanisms disabled by a wide margin. Content fingerprint deduplication is the sole exception, showing a slight performance gain when removed due to the presence of near-duplicate gold documents in the dataset.

The authors evaluate Harness-1 against open-source and frontier retrieval models to assess recall and transfer capabilities. Ablation studies on the BrowseComp+ benchmark isolate the impact of individual inference-time mechanisms, revealing that components such as importance tagging and evidence graphs are critical for maintaining search quality. These results confirm that the stateful harness design is fundamental to effective evidence curation, as disabling these mechanisms consistently leads to significant performance degradation in recall and final-answer accuracy.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp