We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55
Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology
OpenAI· Updated
Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.
CursorBench-3 scores models on correctness plotted against median completion tokens, capturing the compute-latency tradeoff. Results show GPT-5.4 and GPT-5.3 Codex at the top (~63% and ~60%), with Opus 4.6 at ~57% and Haiku 4.5 at ~28%.Public benchmarks like SWE-bench Verified suffer from training data contamination, poor task alignment, and frontier-level saturation — Haiku can match GPT-5 on those evals. CursorBench's task complexity has roughly doubled since launch, now including multi-workspace monorepos and production log investigation. Cursor supplements the offline suite with online controlled experiments on live traffic, catching regressions that graders miss.
Use the published results to inform which models you enable in your coding agent setup — the score-to-token chart makes the quality-versus-cost tradeoff visible across tested models.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →


