Anthropic just launched Claude Opus 4.8, and it is the new leader on our GDPval-AA benchmark for agentic real-world work tasks Opus 4.8 scored 1890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh. Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh. @AnthropicAI shared access with us ahead of the public release to benchmark this model and we’re glad to see our benchmarks referenced in today’s launch. The rest of the Artificial Analysis Intelligence Index is in progress - we’ll share final results soon!
Claude Opus 4.8 takes top spot on agentic work benchmark
Artificial Analysis evaluated Claude Opus 4.8 using its GDPval-AA benchmark, a framework testing models on economically valuable professional tasks. The model achieved an Elo rating (a relative skill ranking) of 1890, taking the top spot. This is a 137-point increase over Claude Opus 4.7, which previously led similar agentic evaluations.
- GDPval-AA Elo Score
- 1890
- Win Rate vs GPT-5.5 xhigh
- 67%
- Output Token Reduction
- 35% vs Opus 4.7
- Turn Efficiency Gain
- 15% vs Opus 4.7
- Benchmark Scope
- 44 occupations across 9 industries
The results establish a new frontier for agentic AI (systems that autonomously execute multi-step goals). While OpenAI's GPT-5.5 requires 30% fewer turns to finish tasks, Claude Opus 4.8 maintains a 67% win rate against its rival. Anthropic's flagship prioritizes successful completion of complex deliverables over raw speed or brevity.
This shift indicates that Claude Opus 4.8 is currently the most reliable choice for long-horizon work. It is 35% more token-efficient than its predecessor, mirroring gains noted in recent industry analysis. These improvements, alongside high scores in frontend coding tests, make it a primary candidate for enterprise workflows.
Artificial Analysis
@ArtificialAnlys
101retweets1.1klikes
View on X



