Details
- OpenAI introduced EVMbench, a new benchmark developed with Paradigm to evaluate AI agents' ability to detect, exploit, and patch high-severity vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts.
- The benchmark draws from 120 real vulnerabilities curated from 40 professional audits, primarily open competitions like Code4rena, plus scenarios from Tempo blockchain security work.
- EVMbench tests three modes: detect (auditing for vulnerabilities scored on recall), patch (fixing issues while preserving functionality), and exploit (executing fund-draining attacks in a sandboxed local Ethereum environment).
- In evaluations, OpenAI's GPT-5.3-Codex scored 72.2% in exploit mode, a major leap from GPT-5's 31.9%, though detect and patch modes showed gaps in comprehensive coverage.
- OpenAI is committing $10 million in API credits for cyber defense, expanding Aardvark security agent beta, and partnering on free codebase scanning for open-source projects.
- The framework uses Rust-based re-execution for reproducible grading and is fully open-sourced with code, tasks, and tooling.
Impact
OpenAI's EVMbench arrives as AI coding prowess accelerates, directly addressing risks to the $100 billion-plus in assets locked in open-source smart contracts, where exploits could drain funds irreversibly due to blockchain immutability. By benchmarking end-to-end capabilities like exploiting live-like chains, it pressures rivals such as Anthropic and Google DeepMind to prioritize agentic security evaluations, especially after recent advances in their coding models; GPT-5.3-Codex's 72.2% exploit score outpaces prior OpenAI versions and highlights how frontier agents now handle complex, multi-step attacks better than humans in isolated tasks. This convergence of AI and crypto lowers barriers for defensive auditing—via $10M credits and tools like Aardvark—potentially widening access for smaller protocols and shifting market dynamics toward AI-assisted security, reducing reliance on costly human audits from firms like Code4rena. Over 12-24 months, it could steer R&D toward comprehensive coverage benchmarks, influencing funding flows into secure agent frameworks amid rising regulatory scrutiny on crypto vulnerabilities and AI safety.
