Details
- Scale AI has introduced SWE-Bench Pro, a benchmark featuring 1,865 coding tasks to evaluate AI coding agents on unseen code drawn from GPL-licensed and private commercial repositories.
- SWE-Bench Pro addresses major shortcomings of current benchmarks, such as data contamination, lack of diverse challenges, and problems that are too simplified or unrealistic compared to day-to-day software engineering demands.
- Instead of filtering out complex or ambiguous scenarios, the benchmark includes underspecified tasks requiring significant modifications—averaging over 100 lines of code across multiple files—bolstered by enhanced human-edited problem statements.
- State-of-the-art models like GPT-5 and Claude Opus 4.1 saw their performance plunge to about 23% on SWE-Bench Pro, starkly lower than the 70% or higher scores they achieved on the earlier SWE-Bench Verified benchmark.
- Model results varied by language and codebase; AI performed best with Go and Python but markedly worse with JavaScript, TypeScript, and commercial private code, which posed the toughest challenges.
Impact
SWE-Bench Pro reveals a striking mismatch between AI coding performance on traditional benchmarks and real-world complexity. As enterprises evaluate generative AI for programming, these findings underscore the need for rigorous testing on proprietary code, highlighting that current models may fall short outside idealized assessments.