12 days ago

WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang

View Paper Details View Code

WideSearch: Benchmarking Agentic Broad Info-Seeking

Abstract

From professional research to everyday planning, many tasks are bottleneckedby wide-scale information seeking, which is more repetitive than cognitivelycomplex. With the rapid development of Large Language Models (LLMs), automatedsearch agents powered by LLMs offer a promising solution to liberate humansfrom this tedious work. However, the capability of these agents to perform such"wide-context" collection reliably and completely remains largely unevaluateddue to a lack of suitable benchmarks. To bridge this gap, we introduceWideSearch, a new benchmark engineered to evaluate agent reliability on theselarge-scale collection tasks. The benchmark features 200 manually curatedquestions (100 in English, 100 in Chinese) from over 15 diverse domains,grounded in real user queries. Each task requires agents to collect large-scaleatomic information, which could be verified one by one objectively, and arrangeit into a well-organized output. A rigorous five-stage quality control pipelineensures the difficulty, completeness, and verifiability of the dataset. Webenchmark over 10 state-of-the-art agentic search systems, includingsingle-agent, multi-agent frameworks, and end-to-end commercial systems. Mostsystems achieve overall success rates near 0\%, with the best performerreaching just 5\%. However, given sufficient time, cross-validation by multiplehuman testers can achieve a near 100\% success rate. These results demonstratethat present search agents have critical deficiencies in large-scaleinformation seeking, underscoring urgent areas for future research anddevelopment in agentic search. Our dataset, evaluation pipeline, and benchmarkresults have been publicly released at https://widesearch-seed.github.io/