OpenRouter Agent Battle Royale Reveals Alignment Tax Impacts Performance

jacky

Jun 13, 2026

OpenRouter ran 11 LLMs through 30 battle royale games to test agent performance in zero-sum settings. Grok 4.1 Fast won 13 games at 0.97 dollars per win, while Claude Sonnet 4.6 struggled by prioritizing cooperation. The results show that alignment training, designed for safety, can act as a performance tax in competitive tasks where ruthlessness is required.

View the full update on openrouter.ai

jacky

@jjacky3d ago

no benchmark will tell you this: LLMs can be /too/ nice unsurprisingly, in a competitive zero-sum setting, being nice can be bad i built royale: last agent standing, a br for agents, and ran it 30 times the nicest model lost hard. the model you least expected, won 🧵: https://t.co/lEFpfqnIdJ

946

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

OpenRouter Adds Grok 4.3 With Massive Agentic Performance Jump and Lower Pricing

OpenRouter integrated xAI's new Grok-4.3 reasoning model, which features a 1 million token context window and a significant boost in autonomous task performance. The model achieved a 1500 ELO on the GDPval-AA benchmark for economically valuable tasks, surpassing previous flagship models while launching at a lower price point than its predecessor.

Arena.ai Ranks xAI's Grok Build 0.1 Above Grok 4.3 in Agent Arena

Arena5d ago

Arena.ai Ranks xAI's Grok Build 0.1 Above Grok 4.3 in Agent Arena

Arena.ai's new Agent Arena leaderboard places xAI's Grok Build 0.1 at #15 and Grok 4.3 (High) at #17. Grok Build 0.1 demonstrates improved bash capability and looks to be successfully completing tasks more often overall than Grok 4.3, though it is slightly less steerable and more prone to tool hallucinations.

AnthropicApr 25

Anthropic Proves Smarter AI Agents Win Better Deals in Marketplaces

Anthropic's Project Deal experiment allowed autonomous Claude agents to negotiate and trade real physical goods on behalf of employees. The study found that while agent-to-agent commerce is viable, users with more capable models secured significantly better financial outcomes without realizing they were at a disadvantage.

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AIMay 20

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI released a benchmark report on browser agents revealing that malformed outputs create a hidden execution tax that inflates production costs. The study found that reliability in multi-step loops matters more than raw intelligence, with some frontier models wasting nearly a quarter of their inference budget on retries.