- OpenAI and Paradigm built EVMbench from 120 real audit vulnerabilities.
- Benchmark tests AI in detect, patch, and exploit modes using sandboxed EVM environments.
- GPT-5.3-Codex scored 72.2% in exploit mode, outperforming earlier GPT-5 results.
OpenAI, working with Paradigm, unveiled a new benchmark to test AI performance on Ethereum smart contract security. The release, announced this week, introduced EVMbench as a way to measure how AI agents detect, patch, and exploit contract flaws. The effort targets rising risks, as smart contracts secure over $100 billion in crypto assets across EVM networks.
Benchmark Built From Real-World Audit Failures
According to OpenAI, EVMbench draws from 120 high-severity vulnerabilities identified across 40 professional smart contract audits. Notably, many of these issues originated from open audit competitions, including Code4rena. The benchmark focuses on real bugs rather than synthetic examples.
In addition, OpenAI said the dataset includes scenarios linked to security work on the Tempo chain. Tempo operates as a payment-focused Layer-1 network built for stablecoin transfers. Because of that, these cases introduce payment logic risks into the benchmark environment.
To support realistic testing, engineers reused exploit proof-of-concept scripts where available. However, they manually built missing components when documentation proved incomplete. OpenAI said it preserved exploitability while ensuring patches could compile correctly.
Three Testing Modes Stress AI Agents
EVMbench evaluates agents in detect, patch, and exploit modes. In detect mode, agents scan repositories and receive scores based on confirmed vulnerability recall. In patch mode, agents must fix flaws while preserving original contract behavior.
Exploit mode, however, simulates full fund-draining attacks within a sandbox blockchain. OpenAI said graders confirm outcomes through transaction replay and on-chain state checks. To ensure consistency, the company built a Rust-based harness for deterministic deployments.
The exploit tests run in a local Anvil environment, not live networks. OpenAI noted that all vulnerabilities are historical and publicly disclosed. Additionally, the harness restricts unsafe RPC calls to reduce misuse.
Results and Team Expansion
In reported results, GPT-5.3-Codex achieved a 72.2% score in exploit mode. By comparison, GPT-5 reached 31.9%, despite launching months earlier. However, OpenAI said detection and patch coverage remains incomplete.
Alongside EVMbench, OpenAI confirmed a key hire. Peter Steinberger, founder of OpenClaw, joined the company to work on agent development. Sam Altman confirmed the move on X, noting Steinberger will lead next-generation personal agent projects.
