WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

The advent of Large Language Model (LLM)-powered agents has revolutionizedartificial intelligence by enabling solutions to complex, open-ended tasksthrough web-based information-seeking (IS) capabilities. The scarcity ofhigh-quality training data has limited the development of IS agents. Existingapproaches typically adopt an information-driven paradigm that first collectsweb data and then generates questions based on the retrieval. However, this maylead to inconsistency between information structure and reasoning structure,question and answer. To mitigate, we propose a formalization-driven IS datasynthesis framework WebShaper to construct a dataset. WebShaper systematicallyformalizes IS tasks through set theory. Central to the formalization is theconcept of Knowledge Projections (KP), which enables precise control overreasoning structure by KP operation compositions. During synthesis, we begin bycreating seed tasks, then use a multi-step expansion process. At each step, anagentic Expander expands the current formal question more complex withretrieval and validation tools based on our formalization. We train our modelon the synthesized dataset. Experiment results demonstrate that WebShaperachieves state-of-the-art performance among open-sourced IS agents on GAIA andWebWalkerQA benchmarks.