4 months ago

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang

Abstract

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$ % pass@1 and $33.86$ % pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$ % pass@1 and $15$ % pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

4 months ago

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

4 months ago

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang5 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang5 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang5 more

Abstract

Build AI with AI

HyperAI Newsletters

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang

Zijian Wu Xiangyan Liu Xinyuan Zhang Lingjun Chen Fanqing Meng Lingxiao Du Yiran Zhao Fanshi Zhang Yaoqi Ye Jiawei Wang