Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.
Arena.ai Launches Agent Arena to Evaluate AI Agents on Real-World Work
Arena· Updated
Arena.ai introduced Agent Arena, a new leaderboard that evaluates agentic AI models on their ability to perform complex, real-world tasks using tools like web search and terminal. It measures performance across five signals, including task success and error recovery, with OpenAI's GPT-5.5 (High) and Anthropic's Claude-Opus-4.7 (Thinking) leading the initial rankings. It gives a live read on how agents perform in practical, multi-step workflows.
- Top-ranked model
- GPT-5.5 (High)
- Second-ranked model
- Claude Opus 4.7 (Thinking)
- Evaluation signals
- Task success, steerability, error recovery, user praise vs. complaint, tool hallucination
- Tasks analyzed
- 160,480
- Tool calls logged
- 2M+
- Lines of code written by agents
- 40.3M
This evaluation uses a "causal inference" methodology, analyzing millions of live user sessions to capture nuanced agent behaviors. It assesses factors like steerability (agent executes user corrections), error recovery, user praise versus complaints, and tool hallucination. This comprehensive approach is critical as agents tackle complex, multi-step workflows.
The leaderboard offers insights into how models perform across practical applications, from coding to research. Based on 160,480 tasks, 2M tool calls, and 40M lines of code by agents, it tracks evolving capabilities and trade-offs of frontier models. This expands Arena.ai's focus on specialized evaluations, following its Task Specific Leaderboards for coding.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

