Arena.ai Ranks xAI's Grok Build 0.1 Above Grok 4.3 in Agent Arena

ArenaArena

Arena.ai's new Agent Arena leaderboard places xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 demonstrates improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3, though it is slightly less steerable and more prone to tool hallucinations.

Arena.ai's Agent Arena leaderboard now ranks xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3. However, it is slightly less steerable and more prone to tool hallucinations—when an AI model generates plausible-sounding but factually incorrect information—than Grok 4.3.
Grok Build 0.1 Overall Rank
#15
Grok 4.3 (High) Overall Rank
#17
Grok Build 0.1 Net Improvement
-5.3%
Grok 4.3 (High) Net Improvement
-9.4%
Grok Build 0.1 Bash Recovery Rank
#9 (+6.1%)
Grok 4.3 (High) Bash Recovery Rank
#16 (-3.8%)

The Agent Arena evaluates models on real-world agentic tasks—where AI autonomously plans and uses tools—using a causal tracing methodology. This offers a detailed view of model performance in complex workflows, revealing strengths and weaknesses across signals. The results show a trade-off in Grok Build 0.1's development, with enhanced execution capabilities balanced against control and accuracy challenges.

The Agent Arena leaderboard details scores across five signals: Confirmed Success, Praise vs. Complaint, Steerability, Bash Recovery, and Tool Hallucination. Explore the full rankings and methodology on the Arena.ai website, which launched its Agent Arena to evaluate frontier AI. More details on the Grok Build 0.1 model from xAI are also available.

Agent Arena leaderboard rankings for Grok Build models showing net improvement metrics against the baseline performance.
Arena.ai
Arena.ai
@arena
X

Grok Build 0.1 ranks #15 and Grok 4.3 (High) #17 in the new Agent Arena leaderboard. Grok Build 0.1 improves meaningfully on bash capability over Grok 4.3. It is slightly less steerable and more prone to tool hallucinations, but looks to be successfully completing tasks more often overall. Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how much better or worse it is than the average model. The thread breaks down how each model from @xAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

3retweets22likes
View on X

Still wondering? A few quick answers below.

The Agent Arena leaderboard dynamically ranks AI models based on their performance in real-world agentic tasks. It evaluates how well models orchestrate tools, measuring signals like task completion, tool reliability, and steerability using a causal tracing methodology.

Grok Build 0.1 ranked #15 overall, while Grok 4.3 (High) ranked #17. Grok Build 0.1 showed meaningful improvement in bash capability and completed tasks more often. However, it was slightly less steerable and more prone to tool hallucinations than Grok 4.3 (High).

The Agent Arena evaluates models across five key signals: Confirmed Success, Praise vs. Complaint, Steerability, Bash Recovery, and Tool Hallucination. These signals provide a comprehensive assessment of an AI agent's ability to perform complex tasks.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update