Claude Opus 4.8 takes top spot on agentic work benchmark

Artificial Analysis

Jun 1, 2026 · Updated Jun 12, 2026

Anthropic's Claude Opus 4.8 has claimed the lead on the GDPval-AA leaderboard for agentic professional tasks. The model achieved an 1890 Elo rating, demonstrating a 67% win rate against GPT-5.5 xhigh in real-world work scenarios. This update establishes a new performance ceiling for AI agents capable of producing complex office deliverables.

Artificial Analysis evaluated Claude Opus 4.8 using its GDPval-AA benchmark, a framework testing models on economically valuable professional tasks. The model achieved an Elo rating (a relative skill ranking) of 1890, taking the top spot. This is a 137-point increase over Claude Opus 4.7, which previously led similar agentic evaluations.

GDPval-AA Elo Score: 1890
Win Rate vs GPT-5.5 xhigh: 67%
Output Token Reduction: 35% vs Opus 4.7
Turn Efficiency Gain: 15% vs Opus 4.7
Benchmark Scope: 44 occupations across 9 industries

The results establish a new frontier for agentic AI (systems that autonomously execute multi-step goals). While Claude Opus 4.8 uses about 30% more turns than OpenAI's GPT-5.5, it maintains a 67% win rate against its rival. Anthropic's flagship prioritizes successful completion of complex deliverables over raw speed or brevity.

This shift indicates that Claude Opus 4.8 is currently the most reliable choice for long-horizon work. It is 35% more token-efficient than its predecessor, mirroring gains noted in recent industry analysis. These improvements make it a primary candidate for enterprise workflows.

View the full update on artificialanalysis.ai

Artificial Analysis

@ArtificialAnlysMay 28

Anthropic just launched Claude Opus 4.8, and it is the new leader on our GDPval-AA benchmark for agentic real-world work tasks Opus 4.8 scored 1890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh. Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh. @AnthropicAI shared access with us ahead of the public release to benchmark this model and we’re glad to see our benchmarks referenced in today’s launch. The rest of the Artificial Analysis Intelligence Index is in progress - we’ll share final results soon!

1011.1k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Artificial Analysis →

Keep reading

Artificial Analysis crowns Claude Opus 4.8 as the new intelligence leader

Artificial Analysis has ranked Claude Opus 4.8 as the new leader on its Intelligence Index, surpassing GPT-5.5 (xhigh). The model shows significant gains in agentic workflows and scientific reasoning while maintaining lower hallucination rates than its peers. This shift marks a return to the top for Anthropic in independent frontier model evaluations.

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

ClaudeMay 29

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

Anthropic released Claude Opus 4.8, an upgraded flagship model featuring improved honesty and a new effort control setting for granular reasoning depth. The update shifts the focus toward long-horizon autonomy by allowing the model to run parallel subagents for massive code migrations while catching its own bugs.

Anthropic Claude Models Sweep Top Five Spots in Arena Coding Leaderboard

ArenaMay 7

Anthropic Claude Models Sweep Top Five Spots in Arena Coding Leaderboard

Arena.ai's latest Image-to-WebDev leaderboard shows Anthropic's Claude models occupying the entire top five, with Claude Opus 4.7 Thinking taking the #1 position. The shift highlights a rapid turnover in agentic coding performance as older frontier models from OpenAI and Google fall out of the top rankings.

WarpMay 28

Warp integrates Claude Opus 4.8 to enable autonomous multi step engineering tasks

Warp integrated Anthropic's Claude Opus 4.8 and 4.8 Fast into its agentic development environment. The update shifts the focus from single-turn code generation to longer agent runs where models plan, execute, and review their own work.