4 months ago

Abstract

The Model Context Protocol (MCP) is rapidly emerging as a pivotal openstandard, designed to enhance agent-tool integration and interoperability, andis positioned to unlock a new era of powerful, interconnected, and genuinelyutilitarian agentic AI. However, despite MCP's growing adoption, existingbenchmarks often fail to capture real-world agent performance within this newparadigm, leading to a distorted perception of their true operational value andan inability to reliably differentiate proficiencies. To bridge this criticalevaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmarkspecifically engineered to rigorously assess language agent capabilities inMCP-mediated tool interactions. Core contributions of MCP-AgentBench include:the establishment of a robust MCP testbed comprising 33 operational serverswith 188 distinct tools; the development of a benchmark featuring 600systematically designed queries distributed across 6 distinct categories ofvarying interaction complexity; and the introduction of MCP-Eval, a noveloutcome-oriented evaluation methodology prioritizing real-world task success.Through extensive empirical evaluation of leading language agents, we providefoundational insights. MCP-AgentBench aims to equip the research community witha standardized and reliable framework to build, validate, and advance agentscapable of fully leveraging MCP's transformative benefits, thereby acceleratingprogress toward truly capable and interoperable AI systems.

Source PDF