6 months ago

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang

Abstract

Tool calling has emerged as a critical capability for AI agents to interactwith the real world and solve complex tasks. While the Model Context Protocol(MCP) provides a powerful standardized framework for tool integration, there isa significant gap in benchmarking how well AI agents can effectively solvemulti-step tasks using diverse MCP tools in realistic, dynamic scenarios. Inthis work, we present LiveMCP-101, a benchmark of 101 carefully curatedreal-world queries, refined through iterative LLM rewriting and manual review,that require coordinated use of multiple MCP tools including web search, fileoperations, mathematical reasoning, and data analysis. Moreover, we introduce anovel evaluation approach that leverages ground-truth execution plans ratherthan raw API outputs, better reflecting the evolving nature of real-worldenvironments. Experiments show that even frontier LLMs achieve a success ratebelow 60%, highlighting major challenges in tool orchestration. Detailedablations and error analysis further reveal distinct failure modes andinefficiencies in token usage, pointing to concrete directions for advancingcurrent models. LiveMCP-101 sets a rigorous standard for evaluating real-worldagent capabilities, advancing toward autonomous AI systems that reliablyexecute complex tasks through tool use.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang4 more

Abstract

Build AI with AI

HyperAI Newsletters

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang

Ming Yin Dinghan Shen Silei Xu Jianbing Han Sixun Dong Mian Zhang Yebowen Hu Shujian Liu Simin Ma Song Wang