il y a 6 heures

DR Tulu : Apprentissage par renforcement avec rubriques évolutives pour la recherche approfondie

Voir les détails de l'article Voir le code

Rulin Shao Akari Asai Shannon Zejiang Shen Hamish Ivison Varsha Kishore Jingming Zhuo Xinran Zhao Molly Park Samuel G. Finlayson David Sontag

DR Tulu : Apprentissage par renforcement avec rubriques évolutives pour la recherche approfondie

Résumé

Voici la traduction du texte en français, respectant le style formel et la terminologie propre au domaine de la technologie et de la recherche académique :Les modèles de recherche approfondie (deep research) effectuent des recherches en plusieurs étapes afin de produire des réponses longues et correctement référencées. Toutefois, la plupart des modèles ouverts de recherche approfondie sont entraînés sur des tâches de questions-réponses courtes et facilement vérifiables via l'apprentissage par renforcement avec récompenses vérifiables (RLVR), une méthode qui ne se généralise pas aux tâches réalistes de format long.Nous répondons à cette problématique grâce à l'Apprentissage par Renforcement avec Rubriques Évolutives (RLER), méthode par laquelle nous construisons et maintenons des grilles d'évaluation (rubrics) qui co-évoluent avec le modèle de politique durant l'entraînement. Cela permet à ces grilles d'intégrer les informations nouvellement explorées par le modèle et de fournir une rétroaction discriminante et on-policy (sur la politique courante).En utilisant le RLER, nous avons développé Deep Research Tulu (DR Tulu-8B), le premier modèle ouvert directement entraîné pour la recherche approfondie ouverte (open-ended) et de format long. Sur quatre bancs d'essai de recherche approfondie couvrant les sciences, la santé et des domaines généraux, DR Tulu surpasse considérablement les modèles ouverts existants et égale, voire dépasse, les systèmes propriétaires de recherche approfondie, tout en étant nettement plus compact et moins coûteux par requête.Afin de faciliter les travaux futurs, nous mettons à disposition l'intégralité des données, des modèles et du code, y compris notre nouvelle infrastructure d'agents basée sur MCP pour les systèmes de recherche approfondie.

Summarization

Researchers from the University of Washington, Allen Institute for AI, MIT, et al. introduce Deep Research Tulu (DR Tulu-8B), an open model that leverages Reinforcement Learning with Evolving Rubrics (RLER) to generate dynamic, on-policy feedback for long-form tasks, thereby matching the performance of proprietary systems in complex domains like science and healthcare.

Introduction

Deep research models are designed to execute complex tasks by planning, searching, and synthesizing information from diverse sources, yet training open-source versions remains difficult due to the challenge of verifying long-form, open-ended responses. While proprietary systems have advanced significantly, reliable evaluation requires access to dynamic world knowledge rather than static metrics, creating a bottleneck for current training methodologies.

Limitations of Prior Work Existing open models typically rely on training-free prompting strategies or are trained indirectly using short-form question answering as a proxy for deep research. These approaches often depend on fixed, hard-coded search tools and static evaluation rubrics, which fail to capture the nuance of complex reasoning or provide the verifiable citations necessary for rigorous academic and professional tasks.

Main Contribution The authors introduce Deep Research Tulu (DR Tulu-8B), the first open model directly trained for long-form deep research tasks. Their primary methodological contribution is Reinforcement Learning with Evolving Rubrics (RLER), a training framework where evaluation criteria co-evolve with the policy model. This allows the system to generate dynamic, on-policy feedback that contrasts specific model responses against external evidence found during the search process.

Key Innovations and Advantages

Dynamic Rubric Evolution: RLER constructs new rubrics at each training step by analyzing the model's search traces, preventing reward hacking and ensuring feedback adapts to the model's expanding knowledge base.
High Efficiency and Performance: Despite being an 8B parameter model, DR Tulu outperforms larger open models (up to 30B) and matches proprietary systems like OpenAI Deep Research while reducing inference costs by nearly three orders of magnitude.
Adaptive Tool Usage: The model learns to autonomously select the most appropriate search tools (e.g., switching between general web search and academic paper repositories) and generates comprehensive reports with accurate, evidence-linked citations.

Dataset

The authors construct a specialized dataset of approximately 16,000 trajectories to address the "cold start" problem, teaching the model to plan, invoke tools, and format citations before Reinforcement Learning (RL). The dataset construction involves the following components and processes:

Data Composition: The dataset combines long-form, open-ended information-seeking queries with short-form, verifiable QA tasks. This mixture ensures the model learns to adapt its style without overfitting to a single task type.
Long-form Sources: Prompts are derived from SearchArena (real-world user-assistant conversations) and OpenScholar (scientific research queries). The authors apply an LLM-based filter where GPT-5 rates prompts on a 1 to 5 scale for complexity, retaining approximately 10% of SearchArena and 20% of OpenScholar queries.
Short-form Sources: To encourage precision, the authors sample questions from HotpotQA, TaskCraft, WebWalker-Silver, MegaScience, PopQA, and TyDi QA. They also generate synthetic questions in the style of BrowseComp using GPT-4.1.
Trajectory Generation: The authors use GPT-5 as a teacher model to generate end-to-end trajectories. These include explicit "thinking" tokens, tool invocations (Google Search, Web Browse, Paper Search), and final answers with citations.
Data Transformation: For Ai2 ScholarQA, the authors transform static retrieval data into iterative search trajectories. They use GPT-4.1 to generate sub-queries based on section text and cited papers, interleaving these with reasoning steps to simulate a multi-step research process.
Rejection Sampling: The team applies strict filtering to the generated data. They discard trajectories that fail to follow tool-calling formats or omit answer tags. For short-form tasks, they retain only those examples where the generated answer matches the gold standard, verified via F1 overlap (>0.9) or an LLM judge.
RL Data Separation: Distinct from the SFT set, the authors curate a separate set of high-quality prompts from SearchArena and OpenScholar for the RL stage, generating rubrics using GPT-4.1-mini based on retrieved search results.

Method

The authors leverage a reinforcement learning framework with evolving rubrics (RLER) to train deep research models that can perform long-form reasoning and information synthesis using external tools. The core of this approach is a policy model, denoted as $\pi_\theta$ , which operates in an autoregressive manner over a sequence of actions and content. The model's action space includes four distinct types: think, which generates internal reasoning; tool, which invokes external search tools such as web browsing or paper search; answer, which produces the final response; and cite, which wraps claims in citation tags. The policy interacts with an environment that provides access to various search tools, enabling it to retrieve and incorporate external knowledge into its responses. The training process is designed to optimize the model's ability to generate high-quality, well-supported answers by using a dynamic set of evaluation criteria, or rubrics, that adapt over time.

The framework begins with a set of persistent rubrics, which are generated for each training instance by retrieving relevant documents from the internet and using a language model to produce a set of initial evaluation criteria. These rubrics are grounded in real-world knowledge and are intended to be stable throughout the training process. During each training step, the policy model generates multiple rollouts, or response trajectories, for a given prompt. These rollouts are then used to generate a new set of evolving rubrics. The generation process involves contrasting the different responses to identify both positive aspects—such as new, relevant knowledge explored by the model—and negative aspects, such as undesirable behaviors like reward hacking or verbatim copying. This process ensures that the rubrics are tailored to the current policy's behaviors and are grounded in the knowledge it has explored.

The evolving rubrics are added to a rubric buffer, which is managed to maintain a compact and informative set. After each training step, the model scores all generated responses using the current set of rubrics, and the standard deviation of the rewards for each rubric is computed. Rubrics with zero variance are removed, as they offer no discriminative power, and the remaining rubrics are ranked by their standard deviation. Only the top $K_{\max}$ rubrics with the highest variance are retained, ensuring that the most informative and discriminative criteria are used for evaluation. This dynamic management of the rubric buffer allows the evaluation criteria to co-evolve with the policy, providing on-policy feedback that the model can effectively learn from.

Experiment

Evaluated DR Tulu-8B on four long-form benchmarks (HealthBench, ResearchQA, SQAv2, DeepResearchBench), where it outperformed existing open deep research models by 13.7 to 53.4 points on average.
Compared performance against proprietary systems, showing that DR Tulu-8B matches or surpasses OpenAI Deep Research and Perplexity on SQAv2 while reducing query costs to approximately USD 0.0018.
Assessed the impact of Reinforcement Learning with Evolving Rubrics (RLER), finding that it improved rubric coverage by 8.2 points and boosted citation precision and recall on SQAv2 by 23.3 and 22.1 points, respectively.
Validated performance on a novel GeneticDiseasesQA dataset, where the model surpassed Ai2 ScholarQA and Gemini 3 + Search in overall scores and matched proprietary models in evidence quality and synthesis.
Tested generalization on short-form QA tasks (SimpleQA, 2Wiki), revealing that RL training improved average accuracy by 4.4 points despite the focus on long-form optimization.
Conducted ablation studies on training methodologies, demonstrating that search-based and evolving rubrics are over 50% more assertive than naive methods and that combining long-form and short-form SFT data prevents performance degradation.

The authors use search-based rubrics to generate more specific and factual evaluation criteria compared to general or closed-book rubrics. Results show that initial and evolving rubrics achieve higher fractions of assertive claims and factuality, with evolving rubrics reaching perfect factuality scores, indicating they are more effective for training deep research agents.

The authors use a combination of search-based and evolving rubrics to improve the quality of deep research agents, with results showing that DR Tulu-8B (RL) outperforms all open deep research models and matches or exceeds proprietary systems on long-form benchmarks. The model achieves the highest average score of 63.7 across four evaluation datasets, demonstrating superior performance in both content quality and citation accuracy compared to baselines.

The authors use the table to compare the performance of various deep research models on two long-form benchmarks, AstaBench-ScholarQA-CS2 (SQAv2) and DeepResearchBench (DRB). Results show that DR Tulu-8B (RL) achieves the highest scores on both benchmarks, outperforming all other open and closed deep research models, including larger proprietary systems like GPT-5 and Gemini3 Pro. The model also demonstrates strong performance on citation metrics, with the RL version achieving 88.6 on Cite-P and 73.7 on Cite-R on SQAv2, significantly surpassing most baselines.

The authors use the table to compare the performance and cost of various deep research systems across metrics such as answer length, citations, tool calls, and cost per query. Results show that DR Tulu-8B (RL) achieves a balanced performance with moderate answer length, 35.8 citations, and 4.3 tool calls, while maintaining a significantly lower cost of $0.0019 per query compared to proprietary systems like OpenAI Deep Research and GPT-5 + Search, which cost 1.8 and 0.29 per query respectively.

The authors use a short-form QA evaluation to assess the generalization of their deep research models beyond long-form tasks. Results show that DR Tulu-8B (RL) achieves the highest average score of 62.4 across SimpleQA, 2Wiki, and WebWalker, outperforming both the base Qwen3-8B and QwQ-32B models as well as the SFT version of DR Tulu. This indicates that the RL training, despite being applied only to long-form tasks, improves performance on short-form QA benchmarks.

Construire l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec du co-codage IA gratuit, un environnement prêt à l'emploi et les meilleurs prix GPU.

Co-codage IA

GPU prêts à utiliser

Meilleurs prix

Commencer

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette