Arena.ai Adds Claude Opus 4.8 to Agent Arena Leaderboard

Arena

Jun 15, 2026

Arena.ai added Anthropic's Claude Opus 4.8 to its Agent Arena leaderboard. With thinking enabled, the model ties for first place with a 9.1% net improvement in task completion. However, the model shows regressions in steerability and bash recovery, while the non-thinking variant logs one of the highest tool hallucination rates on the platform.

Agent Arena Leaderboard
Claude Opus 4.8
Tied for #1 and Ranked #8
1 GPT-5.5 (High) +9.1%
2 Claude Opus 4.8 (Thinking) +9.1%
3 Claude Opus 4.7 (Thinking) +8.4%
4 Claude Opus 4.6 +8.2%
5 GPT-5.4 (High) +8.0%
6 GPT-5.5 +7.9%
7 Claude Opus 4.7 +7.3%
8 Claude Opus 4.8 +4.3%
9 Claude Sonnet 4.6 +3.9%
10 GLM-5.1 +1.9%
11 DeepSeek-V4 Pro -0.1%
12 Gemini-3.5 Flash -0.3%
13 Kimi-K2.6 -0.6%
14 Gemini-3.1 Pro -1.4%
15 DeepSeek-V4 Flash -1.5%
16 Qwen-3.6 Plus -4.1%
17 Grok Build 0.1 -5.2%
18 MiniMax-M2.7 -8.7%
19 Grok-4.3 (High) -9.2%
20 Gemini-3 Flash -10.1%
21 Gemma-4 31B -14.3%
22 Grok-4.3 -22.7% — Agent Arena leaderboard showing Claude Opus 4.8 performance metrics relative to baseline across various AI models.

View the full update on arena.ai

Arena.ai

@arena5d ago

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-Thinking. The Opus 4.8 models show a small improvement over their predecessor 4.7 specifically when thinking is turned on. With thinking on, it completes more tasks than 4.7, but comes in slightly less steerable and slower to recover from bash errors. This variant also regresses on tool hallucination. With thinking off, it logs one of the highest tool hallucination rates on the leaderboard. Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how it compares to the average model. The thread breaks down how the two Opus 4.8 variants from @AnthropicAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

21381

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Claude Fable 5 Ranks First on Arena Agentic Task Leaderboard

Arena.ai ranks Anthropic's Claude Fable 5 first on its Agent Arena leaderboard with an 11.2% net improvement. The model leads in confirmed task success and user praise, though it ranks 17th in steerability. It outperforms Opus-4.8 and GPT-5.5 by the widest margin recorded on the platform, demonstrating high capability for complex, multi-step agentic workflows.

Claude Opus 4.8 takes top spot on agentic work benchmark

Artificial AnalysisJun 1

Claude Opus 4.8 takes top spot on agentic work benchmark

Anthropic's Claude Opus 4.8 has claimed the lead on the GDPval-AA leaderboard for agentic professional tasks. The model achieved an 1890 Elo rating, demonstrating a 67% win rate against GPT-5.5 xhigh in real-world work scenarios. This update establishes a new performance ceiling for AI agents capable of producing complex office deliverables.

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

ClaudeMay 29

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

Anthropic released Claude Opus 4.8, an upgraded flagship model featuring improved honesty and a new effort control setting for granular reasoning depth. The update shifts the focus toward long-horizon autonomy by allowing the model to run parallel subagents for massive code migrations while catching its own bugs.

WarpMay 28

Warp integrates Claude Opus 4.8 to enable autonomous multi step engineering tasks

Warp integrated Anthropic's Claude Opus 4.8 and 4.8 Fast into its agentic development environment. The update shifts the focus from single-turn code generation to longer agent runs where models plan, execute, and review their own work.