Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

OpenAI

Mar 15, 2026 · Updated Apr 25, 2026

Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.

Cursor, the AI-powered IDE, published CursorBench — an internal evaluation suite built from real developer sessions via Cursor Blame, which traces committed code back to the original agent request. CursorBench-3 scores models on correctness plotted against median completion tokens, capturing the compute-latency tradeoff. Results show GPT-5.4 and GPT-5.3 Codex at the top (~63% and ~60%), with Opus 4.6 at ~57% and Haiku 4.5 at ~28%.

Public benchmarks like SWE-bench Verified suffer from training data contamination, poor task alignment, and frontier-level saturation — Haiku can match GPT-5 on those evals. CursorBench's task complexity has roughly doubled since launch, now including multi-workspace monorepos and production log investigation. Cursor supplements the offline suite with online controlled experiments on live traffic, catching regressions that graders miss.

Use the published results to inform which models you enable in your coding agent setup — the score-to-token chart makes the quality-versus-cost tradeoff visible across tested models.

View the full update on cursor.com

Cursor

@cursor_aiMar 12

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55

225

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenAI →

Keep reading

Cursor Releases Composer 2 Technical Report on Coding Agent Training

Cursor published a technical report on Composer 2, a coding agent trained via pretraining on Kimi K2.5 and RL on real engineering tasks. It scores 61.3 on CursorBench — 37% above Composer 1.5 — matching frontier models at lower cost.

Cursor Now Supports GPT-5.4, Its Current Benchmark Leader

OpenAIMar 5

Cursor Now Supports GPT-5.4, Its Current Benchmark Leader

Cursor added GPT-5.4 to its AI coding editor, ranking it first on Cursor's internal benchmarks. The model is confident and decisive in tackling messy, ambiguous coding problems, and is strong at parallelizing work across long agent sessions.

Artificial AnalysisMay 31

Artificial Analysis Launches Coding Agent Index to Benchmark Performance and Cost

Artificial Analysis has released a specialized benchmarking suite and index for autonomous coding agents. The initial data identifies Claude Code as the performance leader while highlighting Cursor’s Composer 2.5 as a top-tier option for cost-efficiency.