HeadsUpAI

EVMbench Tests AI Agents on Smart Contract Security Detection and Exploitation

· Updated

OpenAI and Paradigm, a crypto research firm, released EVMbench - a benchmark measuring AI agents' ability to detect, patch, and exploit smart contract vulnerabilities. Built from 120 curated vulnerabilities across 40 audits, it tests three modes: Detect (audit for vulnerabilities), Patch (fix them while preserving functionality), and Exploit (execute fund-draining attacks in a sandboxed blockchain environment). GPT-5.3-Codex via Codex CLI scores 72.2% in exploit mode vs GPT-5's 31.9% six months ago.

Agents perform better at exploitation than detection or patching. In Detect mode, agents stop after finding one issue rather than auditing exhaustively. In Patch mode, preserving functionality while removing vulnerabilities remains difficult. Smart contracts secure $100B+ in crypto assets, making AI-driven auditing a meaningful defensive use case as agent capabilities grow.

OpenAI open-sourced EVMbench's tasks, tooling, and evaluation framework, and is committing $10M in API credits for good-faith cybersecurity research.

OpenAI
OpenAI
@OpenAI
X

Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH

1.3kretweets
View on X

Share this update