8 months ago

Abstract

Deep Research Agents are a prominent category of LLM-based agents. Byautonomously orchestrating multistep web exploration, targeted retrieval, andhigher-order synthesis, they transform vast amounts of online information intoanalyst-grade, citation-rich reports--compressing hours of manual desk researchinto minutes. However, a comprehensive benchmark for systematically evaluatingthe capabilities of these agents remains absent. To bridge this gap, we presentDeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks,each meticulously crafted by domain experts across 22 distinct fields.Evaluating DRAs is inherently complex and labor-intensive. We therefore proposetwo novel methodologies that achieve strong alignment with human judgment. Thefirst is a reference-based method with adaptive criteria to assess the qualityof generated research reports. The other framework is introduced to evaluateDRA's information retrieval and collection capabilities by assessing itseffective citation count and overall citation accuracy. We have open-sourcedDeepResearch Bench and key components of these frameworks athttps://github.com/Ayanami0730/deep_research_bench to accelerate thedevelopment of practical LLM-based agents.

Source PDF View Code