NVIDIA AI-Q Tops DeepResearch Bench I and II
NVIDIA AI-Q has achieved the top ranking on both DeepResearch Bench and DeepResearch Bench II, demonstrating a significant advancement in open, portable deep research agents. The system secured a score of 55.95 on the first benchmark and 54.50 on the second, marking the first time a single, configurable stack has led both evaluations. These results highlight that developer-accessible models and tooling can deliver state-of-the-art performance in agentic research. AI-Q functions as an open blueprint allowing enterprises to build, inspect, and customize AI agents that reason over enterprise and web data to produce well-cited reports. Unlike closed systems, its fully modular architecture enables organizations to tailor the technology to specific use cases. The winning submission utilized a multi-agent workflow comprising a planner, a researcher, and an orchestrator, built upon the NVIDIA NeMo Agent Toolkit and fine-tuned NVIDIA Nemotron 3 Super models. The architecture centers on three distinct components. The orchestrator coordinates the entire research loop, directing the planner to create evidence-based strategies and instructing the researcher to gather information. The planner operates in two phases: a scout subagent maps the information landscape, while an architect subagent designs the detailed research plan, including outlines and targeted queries. Finally, the researcher dispatches multiple specialist subagents in parallel, each analyzing the topic through different analytical lenses. This approach ensures diverse perspectives are captured, often revealing evidence a single generalist model would miss. An optional ensemble layer can run multiple independent pipelines simultaneously, merging their outputs to maximize coverage, followed by a post-hoc refiner that polishes the final report for quality and consistency. Several key ingredients contributed to the system's success. First, the team utilized a custom fine-tuned NVIDIA Nemotron-3-Super-120B-A12B model, trained on approximately 67,000 supervised fine-tuning trajectories derived from real-world search and synthesis data. This model aligns effectively with multi-step reasoning and tool use. Second, the system incorporates custom middleware designed for long-horizon reliability. Since agents often perform 32 or more steps involving tool calls, the middleware mitigates specific failure patterns that shorter interactions do not expose, ensuring the system remains robust over extended periods. Third, the design emphasizes flexibility; every component, including the underlying large language models for each agent, can be swapped or configured via YAML files. The core workflow follows a consistent cycle of planning, gathering, and synthesizing information, supported by web search tools like Tavily and academic search tools like Serper. This process results in citation-backed reports that ground claims in retrieved evidence. The separation of concerns between the orchestrator and specialized researchers also serves as a long-context strategy. By having subagents return only synthesized outputs rather than raw tool data, the orchestrator maintains a focused context window, preventing noise from degrading high-level reasoning. NVIDIA's achievement underscores the viability of open, reproducible systems in deep research. By combining the NeMo Agent Toolkit with fine-tuned Nemotron models and reliability-focused middleware, the AI-Q stack delivers superior results without sacrificing transparency or control. The company will share further details on these technologies at NVIDIA GTC in San Jose in March 2026, inviting developers to explore how this flexible architecture can power their own research applications.
