HeadsUpAI

Claude Opus 4.8 takes top spot on agentic work benchmark

Artificial Analysis evaluated Claude Opus 4.8 using its GDPval-AA benchmark, a framework testing models on economically valuable professional tasks. The model achieved an Elo rating (a relative skill ranking) of 1890, taking the top spot. This is a 137-point increase over Claude Opus 4.7, which previously led similar agentic evaluations.
GDPval-AA Elo Score
1890
Win Rate vs GPT-5.5 xhigh
67%
Output Token Reduction
35% vs Opus 4.7
Turn Efficiency Gain
15% vs Opus 4.7
Benchmark Scope
44 occupations across 9 industries

The results establish a new frontier for agentic AI (systems that autonomously execute multi-step goals). While OpenAI's GPT-5.5 requires 30% fewer turns to finish tasks, Claude Opus 4.8 maintains a 67% win rate against its rival. Anthropic's flagship prioritizes successful completion of complex deliverables over raw speed or brevity.

This shift indicates that Claude Opus 4.8 is currently the most reliable choice for long-horizon work. It is 35% more token-efficient than its predecessor, mirroring gains noted in recent industry analysis. These improvements, alongside high scores in frontend coding tests, make it a primary candidate for enterprise workflows.

Artificial Analysis
Artificial Analysis
@ArtificialAnlys
X

Anthropic just launched Claude Opus 4.8, and it is the new leader on our GDPval-AA benchmark for agentic real-world work tasks Opus 4.8 scored 1890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh. Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh. @AnthropicAI shared access with us ahead of the public release to benchmark this model and we’re glad to see our benchmarks referenced in today’s launch. The rest of the Artificial Analysis Intelligence Index is in progress - we’ll share final results soon!

101retweets1.1klikes
View on X

Share this update