Grok Build 0.1 ranks #15 and Grok 4.3 (High) #17 in the new Agent Arena leaderboard. Grok Build 0.1 improves meaningfully on bash capability over Grok 4.3. It is slightly less steerable and more prone to tool hallucinations, but looks to be successfully completing tasks more often overall. Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how much better or worse it is than the average model. The thread breaks down how each model from @xAI scored across 5 signals, drawn from real tasks submitted by a global community of users.
Arena.ai Ranks xAI's Grok Build 0.1 Above Grok 4.3 in Agent Arena
ArenaArena.ai's new Agent Arena leaderboard places xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 demonstrates improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3, though it is slightly less steerable and more prone to tool hallucinations.
- Grok Build 0.1 Overall Rank
- #15
- Grok 4.3 (High) Overall Rank
- #17
- Grok Build 0.1 Net Improvement
- -5.3%
- Grok 4.3 (High) Net Improvement
- -9.4%
- Grok Build 0.1 Bash Recovery Rank
- #9 (+6.1%)
- Grok 4.3 (High) Bash Recovery Rank
- #16 (-3.8%)
The Agent Arena evaluates models on real-world agentic tasks—where AI autonomously plans and uses tools—using a causal tracing methodology. This offers a detailed view of model performance in complex workflows, revealing strengths and weaknesses across signals. The results show a trade-off in Grok Build 0.1's development, with enhanced execution capabilities balanced against control and accuracy challenges.
The Agent Arena leaderboard details scores across five signals: Confirmed Success, Praise vs. Complaint, Steerability, Bash Recovery, and Tool Hallucination. Explore the full rankings and methodology on the Arena.ai website, which launched its Agent Arena to evaluate frontier AI. More details on the Grok Build 0.1 model from xAI are also available.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





