Command Palette
Search for a command to run...
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Abstract
We introduce MCP-Bench, a benchmark for evaluating large language models(LLMs) on realistic, multi-step tasks that demand tool use, cross-toolcoordination, precise parameter control, and planning/reasoning for solvingtasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28representative live MCP servers spanning 250 tools across domains such asfinance, traveling, scientific computing, and academic search. Unlike priorAPI-based benchmarks, each MCP server provides a set of complementary toolsdesigned to work together, enabling the construction of authentic, multi-steptasks with rich input-output coupling. Tasks in MCP-Bench test agents' abilityto retrieve relevant tools from fuzzy instructions without explicit tool names,plan multi-hop execution trajectories for complex objectives, ground responsesin intermediate tool outputs, and orchestrate cross-domain workflows -capabilities not adequately evaluated by existing benchmarks that rely onexplicit tool specifications, shallow few-step workflows, and isolated domainoperations. We propose a multi-faceted evaluation framework covering tool-levelschema understanding and usage, trajectory-level planning, and task completion.Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Codeand data: https://github.com/Accenture/mcp-bench.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.