Artificial Analysis Updates Coding Agent Index with Contamination-Proof DeepSWE Benchmark

Artificial Analysis

Jun 15, 2026

Artificial Analysis updated its Coding Agent Index to v1.1, replacing the gameable SWE-Bench Pro with Datacurve's DeepSWE. DeepSWE tasks are written from scratch to prevent training data contamination. The refreshed leaderboard ranks Claude Code with Fable 5 at 77, followed by Codex with GPT-5.5 at 76 and Claude Code with Opus 4.8 at 73.

Artificial Analysis Coding Agent Index
Composite average pass@1 across DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA
Higher is better
Artificial Analysis
77 Claude Code Fable 5 (max)
76 Codex GPT-5.5 (xhigh)
73 Claude Code Opus 4.8 (max)
71 Codex GPT-5.5 (medium)
67 Claude Code Opus 4.8 (medium)
64 Opencode Opus 4.7 (medium)
62 Cursor CLI GPT-5.5 (medium)
60 Cursor CLI Opus 4.7 (medium)
57 Claude Code Opus 4.7 (medium)
52 Claude Code GLM-5.1
52 Cursor CLI Composer 2.5 Fast
47 Claude Code DeepSeek V4 Pro (high)
47 Claude Code Kimi K2.6 — Artificial Analysis Coding Agent Index ranks model performance across DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA benchmarks.

View the full update on artificialanalysis.ai

Artificial Analysis

@ArtificialAnlys2d ago

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task. The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others. More below.

1851.9k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Artificial Analysis Launches Coding Agent Index to Benchmark Performance and Cost

Artificial Analysis has released a specialized benchmarking suite and index for autonomous coding agents. The initial data identifies Claude Code as the performance leader while highlighting Cursor’s Composer 2.5 as a top-tier option for cost-efficiency.

Anthropic Claude Models Sweep Top Five Spots in Arena Coding Leaderboard

ArenaMay 7

Anthropic Claude Models Sweep Top Five Spots in Arena Coding Leaderboard

Arena.ai's latest Image-to-WebDev leaderboard shows Anthropic's Claude models occupying the entire top five, with Claude Opus 4.7 Thinking taking the #1 position. The shift highlights a rapid turnover in agentic coding performance as older frontier models from OpenAI and Google fall out of the top rankings.

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Cursor5d ago

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Cursor, an AI-first code editor, has made Anthropic's Claude Fable 5 model available within its platform. The model achieved a new state of the art on CursorBench 3.1 with a score of 72.9%, surpassing the previous best by 8 points. This update signifies a notable improvement in AI coding capabilities for complex development tasks.

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

OpenAIMar 15

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.