HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Abstract

Deep research systems, agentic AI that solve complex, multi-step tasks bycoordinating reasoning, search across the open web and user files, and tooluse, are moving toward hierarchical deployments with a Planner, Coordinator,and Executors. In practice, training entire stacks end-to-end remainsimpractical, so most work trains a single planner connected to core tools suchas search, browsing, and code. While SFT imparts protocol fidelity, it suffersfrom imitation and exposure biases and underuses environment feedback.Preference alignment methods such as DPO are schema and proxy-dependent,off-policy, and weak for long-horizon credit assignment and multi-objectivetrade-offs. A further limitation of SFT and DPO is their reliance on humandefined decision points and subskills through schema design and labeledcomparisons. Reinforcement learning aligns with closed-loop, tool-interactionresearch by optimizing trajectory-level policies, enabling exploration,recovery behaviors, and principled credit assignment, and it reduces dependenceon such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundationsof deep research systems. It systematizes work after DeepSeek-R1 along threeaxes: (i) data synthesis and curation; (ii) RL methods for agentic researchcovering stability, sample efficiency, long context handling, reward and creditdesign, multi-objective optimization, and multimodal integration; and (iii)agentic RL training systems and frameworks. We also cover agent architectureand coordination, as well as evaluation and benchmarks, including recent QA,VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. Wedistill recurring patterns, surface infrastructure bottlenecks, and offerpractical guidance for training robust, transparent deep research agents withRL.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Reinforcement Learning Foundations for Deep Research Systems: A Survey | Papers | HyperAI