Details
- Scale AI introduced the MCP-Atlas leaderboard, benchmarking AI models on their capability to use external tools via the Model Context Protocol (MCP), with evaluations spanning 1,000 tasks that utilize over 300 tools and 40 servers.
- The leaderboard assessed 11 advanced AI systems on complex workflows involving three to six tool calls per task, with GPT-5 achieving the highest success rate at 44.5% and the median model (Kimi-K2) at 23.9%.
- MCP-Atlas distinguishes itself from past benchmarks with realistic simulations, incorporating distractors and conditional logic—such as conducting cross-company IPO analyses—to mirror workplace scenarios.
- Analysis found that 34–52% of model failures stemmed from difficulty in using tools—primarily with tool discovery, parameter selection, and workflow coordination—while 23–38% were due to insufficient task comprehension.
- The evaluation uses Scale's proprietary datasets, multiple online search engines, and orchestrates tasks across servers, closely reflecting the operational landscape of enterprise AI deployments.
Impact
MCP-Atlas spotlights ongoing limitations in current AI agents, with even top models frequently stumbling on tasks common to real-world automation. As the industry intensifies its focus on robust enterprise AI applications, these findings underscore the pressing need for improved tool-use proficiency in next-generation models.