HeadsUpAI

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

ยท Updated

Cursor, the AI-powered IDE, published CursorBench โ€” an internal evaluation suite built from real developer sessions via Cursor Blame, which traces committed code back to the original agent request. CursorBench-3 scores models on correctness plotted against median completion tokens, capturing the compute-latency tradeoff. Results show GPT-5.4 and GPT-5.3 Codex at the top (~63% and ~60%), with Opus 4.6 at ~57% and Haiku 4.5 at ~28%.

Public benchmarks like SWE-bench Verified suffer from training data contamination, poor task alignment, and frontier-level saturation โ€” Haiku can match GPT-5 on those evals. CursorBench's task complexity has roughly doubled since launch, now including multi-workspace monorepos and production log investigation. Cursor supplements the offline suite with online controlled experiments on live traffic, catching regressions that graders miss.

Use the published results to inform which models you enable in your coding agent setup โ€” the score-to-token chart makes the quality-versus-cost tradeoff visible across tested models.

Cursor
Cursor
@cursor_ai
X

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55

225retweets
View on X

Share this update