HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World
  Tasks via MCP Servers

Abstract

We introduce MCP-Bench, a benchmark for evaluating large language models(LLMs) on realistic, multi-step tasks that demand tool use, cross-toolcoordination, precise parameter control, and planning/reasoning for solvingtasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28representative live MCP servers spanning 250 tools across domains such asfinance, traveling, scientific computing, and academic search. Unlike priorAPI-based benchmarks, each MCP server provides a set of complementary toolsdesigned to work together, enabling the construction of authentic, multi-steptasks with rich input-output coupling. Tasks in MCP-Bench test agents' abilityto retrieve relevant tools from fuzzy instructions without explicit tool names,plan multi-hop execution trajectories for complex objectives, ground responsesin intermediate tool outputs, and orchestrate cross-domain workflows -capabilities not adequately evaluated by existing benchmarks that rely onexplicit tool specifications, shallow few-step workflows, and isolated domainoperations. We propose a multi-faceted evaluation framework covering tool-levelschema understanding and usage, trajectory-level planning, and task completion.Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Codeand data: https://github.com/Accenture/mcp-bench.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp