We ran 720 browser agent tasks with @nottecore across frontier models. One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows. Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were near zero and latency stayed stable even as tasks extended across multiple steps. Same workload. Same agent loop. Different execution behavior. That gap is what shows up as cost, latency, and reliability divergence in production agent systems. Read the report: https://t.co/6thZVvLomR
Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets
Fireworks AI· Updated
Fireworks AI released a benchmark report on browser agents revealing that malformed outputs create a hidden execution tax that inflates production costs. The study found that reliability in multi-step loops matters more than raw intelligence, with some frontier models wasting nearly a quarter of their inference budget on retries.
- Gemini 2.5 Flash execution tax
- 22.9%
- MiniMax M2.5 cost efficiency
- 2.3x cheaper per successful task than Gemini
- Kimi K2.5 retry rate
- 0% across 852 calls
- Benchmark scope
- 720 tasks across 15 websites
- Primary metric
- Reliability-Adjusted Accuracy
This reliability gap shifts focus from raw intelligence to sustained execution in multi-step loops. While models may ace static benchmarks, they often fail in production when malformed responses cascade into higher costs. Alongside Fireworks AI's Day-0 Kimi K2.6 support, this highlights that the serving layer is critical for agentic stability.
You should evaluate models using reliability-adjusted accuracy rather than simple token pricing. For high-volume workloads, MiniMax M2.5 proved 2.3x cheaper than Gemini 2.5 Flash per outcome. Developers can access these models, including GLM 5.1 training, via the Fireworks serverless API to minimize execution overhead.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





