HyperAIHyperAI
2 days ago

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on
  Challenging Queries
Abstract

Tool calling has emerged as a critical capability for AI agents to interactwith the real world and solve complex tasks. While the Model Context Protocol(MCP) provides a powerful standardized framework for tool integration, there isa significant gap in benchmarking how well AI agents can effectively solvemulti-step tasks using diverse MCP tools in realistic, dynamic scenarios. Inthis work, we present LiveMCP-101, a benchmark of 101 carefully curatedreal-world queries, refined through iterative LLM rewriting and manual review,that require coordinated use of multiple MCP tools including web search, fileoperations, mathematical reasoning, and data analysis. Moreover, we introduce anovel evaluation approach that leverages ground-truth execution plans ratherthan raw API outputs, better reflecting the evolving nature of real-worldenvironments. Experiments show that even frontier LLMs achieve a success ratebelow 60\%, highlighting major challenges in tool orchestration. Detailedablations and error analysis further reveal distinct failure modes andinefficiencies in token usage, pointing to concrete directions for advancingcurrent models. LiveMCP-101 sets a rigorous standard for evaluating real-worldagent capabilities, advancing toward autonomous AI systems that reliablyexecute complex tasks through tool use.