Arena.ai Ranks GPT-5.5 (xHigh) Second on Agent Arena Leaderboard

ArenaArena

Arena.ai ranks OpenAI's GPT-5.5 (xHigh) second on its Agent Arena leaderboard with a 10.6% net improvement. The model achieves top rankings in praise versus complaint at 29.4%, bash recovery at 14.1%, and tool hallucination at 2.1%. It records a 5.4% confirmed success rate and 1.9% steerability score across 160,000 real-world agentic tasks evaluated over seven days.

Agent Arena leaderboard showing GPT-5.5 (xHigh) ranked second with a ten percent net improvement over baseline.
Arena.ai
Arena.ai
@arena
X

GPT-5.5 (xHigh) ranks #2 on Agent Arena (+10.6% net improvement), making it the highest-ranked OpenAI model closely behind Claude Fable 5 (High). Per signal breakdown, GPT-5.5 (xHigh) ranks #1 in Praise vs. Complaint (+29.4%) and Bash Recovery (+14.1%), scoring higher than Claude Fable 5 (High) on both signals. It trails Claude Fable 5 (High) on Confirmed Success (+5.4% vs. +17.6%) and Steerability (+1.9% vs. +5.4%). Agent Arena evaluates models on millions of real-world, long-horizon agentic tasks. Models use tools like web search, filesystem, and terminal to complete complex workflows: writing code, creating slide decks, researching the web, building apps, and analyzing documents. We use causal tracing to measure model performance across real-world agentic tasks. More breakdown of GPT-5.5 (xHigh) across five signals in the thread.

39retweets471likes
View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update