Scale AI Unveils SWE-Bench Pro, Exposing Weaknesses in Leading AI Coding Models

Details

Scale AI has introduced SWE-Bench Pro, a benchmark featuring 1,865 coding tasks to evaluate AI coding agents on unseen code drawn from GPL-licensed and private commercial repositories.
SWE-Bench Pro addresses major shortcomings of current benchmarks, such as data contamination, lack of diverse challenges, and problems that are too simplified or unrealistic compared to day-to-day software engineering demands.
Instead of filtering out complex or ambiguous scenarios, the benchmark includes underspecified tasks requiring significant modifications—averaging over 100 lines of code across multiple files—bolstered by enhanced human-edited problem statements.
State-of-the-art models like GPT-5 and Claude Opus 4.1 saw their performance plunge to about 23% on SWE-Bench Pro, starkly lower than the 70% or higher scores they achieved on the earlier SWE-Bench Verified benchmark.
Model results varied by language and codebase; AI performed best with Go and Python but markedly worse with JavaScript, TypeScript, and commercial private code, which posed the toughest challenges.

Impact

SWE-Bench Pro reveals a striking mismatch between AI coding performance on traditional benchmarks and real-world complexity. As enterprises evaluate generative AI for programming, these findings underscore the need for rigorous testing on proprietary code, highlighting that current models may fall short outside idealized assessments.

Scale AI Unveils SWE-Bench Pro, Exposing Weaknesses in Leading AI Coding Models

Details

Impact

Social

CONTENT

INFO