Arena.ai Ranks xAI's Grok Build 0.1 Above Grok 4.3 in Agent Arena

Arena

Jun 9, 2026 · Updated Jun 20, 2026

Arena.ai's new Agent Arena leaderboard places xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 demonstrates improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3, though it is slightly less steerable and more prone to tool hallucinations.

Arena.ai's Agent Arena leaderboard now ranks xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3. However, it is slightly less steerable and more prone to tool hallucinations—when an AI model generates plausible-sounding but factually incorrect information—than Grok 4.3.

Grok Build 0.1 Overall Rank: #15
Grok 4.3 (High) Overall Rank: #17
Grok Build 0.1 Net Improvement: -5.3%
Grok 4.3 (High) Net Improvement: -9.4%
Grok Build 0.1 Bash Recovery Rank: #9 (+6.1%)
Grok 4.3 (High) Bash Recovery Rank: #16 (-3.8%)

The Agent Arena evaluates models on real-world agentic tasks—where AI autonomously plans and uses tools—using a causal tracing methodology. This offers a detailed view of model performance in complex workflows, revealing strengths and weaknesses across signals. The results show a trade-off in Grok Build 0.1's development, with enhanced execution capabilities balanced against control and accuracy challenges.

The Agent Arena leaderboard details scores across five signals: Confirmed Success, Praise vs. Complaint, Steerability, Bash Recovery, and Tool Hallucination. Explore the full rankings and methodology on the Arena.ai website, which launched its Agent Arena to evaluate frontier AI. More details on the Grok Build 0.1 model from xAI are also available.

View the full update on arena.ai

Arena.ai

@arenaJun 8

Grok Build 0.1 ranks #15 and Grok 4.3 (High) #17 in the new Agent Arena leaderboard. Grok Build 0.1 improves meaningfully on bash capability over Grok 4.3. It is slightly less steerable and more prone to tool hallucinations, but looks to be successfully completing tasks more often overall. Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how much better or worse it is than the average model. The thread breaks down how each model from @xAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

322

View on X

Still wondering? A few quick answers below.

The Agent Arena leaderboard dynamically ranks AI models based on their performance in real-world agentic tasks. It evaluates how well models orchestrate tools, measuring signals like task completion, tool reliability, and steerability using a causal tracing methodology.

Grok Build 0.1 ranked #15 overall, while Grok 4.3 (High) ranked #17. Grok Build 0.1 showed meaningful improvement in bash capability and completed tasks more often. However, it was slightly less steerable and more prone to tool hallucinations than Grok 4.3 (High).

The Agent Arena evaluates models across five key signals: Confirmed Success, Praise vs. Complaint, Steerability, Bash Recovery, and Tool Hallucination. These signals provide a comprehensive assessment of an AI agent's ability to perform complex tasks.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Arena →

Keep reading

Arena.ai Subjects Grok 4.3 to Blind Community Testing for Coding and Vision

Arena.ai added xAI's Grok 4.3 to its blind evaluation leaderboards for text, vision, documents, and frontend coding. This move subjects the new reasoning model to real-world human preference testing to verify its performance against established frontier models.

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

xAIMay 28

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

xAI released Grok Build v0.2.7, introducing a shared terminal backend and scheduler to improve coordination between parallel subagents. The update also optimizes multimodal processing by converting file-based images into vision tokens and brings the CLI to feature parity for Windows users.

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

OpenRouterMay 5

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

OpenRouter integrated xAI's new Grok-4.3 reasoning model, which features a 1 million token context window and a significant boost in autonomous task performance. The model achieved a 1500 ELO on the GDPval-AA benchmark for economically valuable tasks, surpassing previous flagship models while launching at a lower price point than its predecessor.

What is the Agent Arena leaderboard?

How did Grok Build 0.1 perform compared to Grok 4.3 (High)?

What are the key evaluation signals?

Keep reading

Arena.ai Subjects Grok 4.3 to Blind Community Testing for Coding and Vision

Arena.ai Subjects Grok 4.3 to Blind Community Testing for Coding and Vision

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

Keep reading

Arena.ai Subjects Grok 4.3 to Blind Community Testing for Coding and Vision

Arena.ai Subjects Grok 4.3 to Blind Community Testing for Coding and Vision

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

xAI Grok Build 0.2.7 Syncs Parallel Subagents Through a Shared Terminal Backend

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing