EVMbench Tests AI Agents on Smart Contract Security Detection and Exploitation

OpenAI

Feb 18, 2026 · Updated Apr 25, 2026

OpenAI and Paradigm released EVMbench, a benchmark testing AI agents on three smart contract security tasks: detecting vulnerabilities, patching them, and executing exploits. GPT-5.3-Codex scores 72.2% on exploit mode - more than double GPT-5's 31.9% from six months ago.

OpenAI and Paradigm, a crypto research firm, released EVMbench - a benchmark measuring AI agents' ability to detect, patch, and exploit smart contract vulnerabilities. Built from 120 curated vulnerabilities across 40 audits, it tests three modes: Detect (audit for vulnerabilities), Patch (fix them while preserving functionality), and Exploit (execute fund-draining attacks in a sandboxed blockchain environment). GPT-5.3-Codex via Codex CLI scores 72.2% in exploit mode vs GPT-5's 31.9% six months ago.

Agents perform better at exploitation than detection or patching. In Detect mode, agents stop after finding one issue rather than auditing exhaustively. In Patch mode, preserving functionality while removing vulnerabilities remains difficult. Smart contracts secure $100B+ in crypto assets, making AI-driven auditing a meaningful defensive use case as agent capabilities grow.

OpenAI open-sourced EVMbench's tasks, tooling, and evaluation framework, and is committing $10M in API credits for good-faith cybersecurity research.

View the full update on openai.com

OpenAI

@OpenAIFeb 18

Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH

1.3k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenAI →

Keep reading

Codex Security Launches to Find, Validate, and Patch Code Vulnerabilities

OpenAI launched Codex Security, an AI agent that identifies complex code vulnerabilities, validates them automatically, and proposes targeted fixes. It cuts triage noise significantly, so security teams focus on real threats rather than false positives.

Vercel Releases deepsec to Automate Deep Security Audits With Coding Agents

VercelMay 4

Vercel Releases deepsec to Automate Deep Security Audits With Coding Agents

Vercel open-sourced deepsec, a security harness that uses autonomous coding agents to identify and investigate vulnerabilities in large-scale codebases. Unlike traditional scanners that rely on static patterns, this tool uses high-reasoning models to trace data flows and validate findings through a multi-stage pipeline. By moving security audits into an agentic framework, teams can perform deep reviews that were previously too slow or expensive for manual researchers.

Arena.ai Ranks GPT-5.5 as Top Tier for Search and Coding

ArenaApr 30

Arena.ai Ranks GPT-5.5 as Top Tier for Search and Coding

GPT-5.5 entered the Arena.ai leaderboards with a top-two ranking in search and a 50-point performance jump in agentic web development. These community-driven results validate the model's focus on complex tool use and reasoning across vision, math, and document analysis.

OpenAI Prepares Restricted Rollout of GPT-5.5-Cyber for Security Defenders

Sam AltmanMay 1

OpenAI Prepares Restricted Rollout of GPT-5.5-Cyber for Security Defenders

OpenAI is starting the rollout of GPT-5.5-Cyber, a specialized frontier model designed for defensive cybersecurity operations. Access is restricted to vetted institutions and government partners to prevent misuse while arming defenders with advanced AI tools.