5 months ago

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang

Abstract

Deep research systems, agentic AI that solve complex, multi-step tasks bycoordinating reasoning, search across the open web and user files, and tooluse, are moving toward hierarchical deployments with a Planner, Coordinator,and Executors. In practice, training entire stacks end-to-end remainsimpractical, so most work trains a single planner connected to core tools suchas search, browsing, and code. While SFT imparts protocol fidelity, it suffersfrom imitation and exposure biases and underuses environment feedback.Preference alignment methods such as DPO are schema and proxy-dependent,off-policy, and weak for long-horizon credit assignment and multi-objectivetrade-offs. A further limitation of SFT and DPO is their reliance on humandefined decision points and subskills through schema design and labeledcomparisons. Reinforcement learning aligns with closed-loop, tool-interactionresearch by optimizing trajectory-level policies, enabling exploration,recovery behaviors, and principled credit assignment, and it reduces dependenceon such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundationsof deep research systems. It systematizes work after DeepSeek-R1 along threeaxes: (i) data synthesis and curation; (ii) RL methods for agentic researchcovering stability, sample efficiency, long context handling, reward and creditdesign, multi-objective optimization, and multimodal integration; and (iii)agentic RL training systems and frameworks. We also cover agent architectureand coordination, as well as evaluation and benchmarks, including recent QA,VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. Wedistill recurring patterns, surface infrastructure bottlenecks, and offerpractical guidance for training robust, transparent deep research agents withRL.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Reinforcement Learning

Agent

Supervised Fine-Tuning

Method/Architecture

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Reinforcement Learning

Agent

Supervised Fine-Tuning

Method/Architecture

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang1 more

Abstract

Build AI with AI

HyperAI Newsletters

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang

Wenjun Li Zhi Chen Jingru Lin Hannan Cao Wei Han Sheng Liang Zhi Zhang Kuicai Dong Dexun Li Chen Zhang