Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI

May 20, 2026 · Updated Jun 12, 2026

Fireworks AI released a benchmark report on browser agents revealing that malformed outputs create a hidden execution tax that inflates production costs. The study found that reliability in multi-step loops matters more than raw intelligence, with some frontier models wasting nearly a quarter of their inference budget on retries.

Fireworks AI, an inference platform for fast model serving, released a benchmark report revealing a hidden Agent Execution Tax. By running 720 browser tasks, the study found that malformed JSON outputs force silent retries that inflate costs. This tax measures the ratio of wasted inference to productive work.

Gemini 2.5 Flash execution tax: 22.9%
MiniMax M2.5 cost efficiency: 2.3x cheaper per successful task than Gemini
Kimi K2.5 retry rate: 0% across 852 calls
Benchmark scope: 720 tasks across 15 websites
Primary metric: Reliability-Adjusted Accuracy

This reliability gap shifts focus from raw intelligence to sustained execution in multi-step loops. While models may ace static benchmarks, they often fail in production when malformed responses cascade into higher costs. Alongside Fireworks AI's Day-0 Kimi K2.6 support, this highlights that the serving layer is critical for agentic stability.

You should evaluate models using reliability-adjusted accuracy rather than simple token pricing. For high-volume workloads, MiniMax M2.5 proved 2.3x cheaper than Gemini 2.5 Flash per outcome. Developers can access these models, including GLM 5.1 training, via the Fireworks serverless API to minimize execution overhead.

View the full update on fireworks.ai

Fireworks AI

@FireworksAI_HQMay 20

We ran 720 browser agent tasks with @nottecore across frontier models. One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows. Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were near zero and latency stayed stable even as tasks extended across multiple steps. Same workload. Same agent loop. Different execution behavior. That gap is what shows up as cost, latency, and reliability divergence in production agent systems. Read the report: https://t.co/6thZVvLomR

115

View on X

Still wondering? A few quick answers below.

The Agent Execution Tax is a metric that measures the ratio of wasted inference to productive inference in autonomous agent systems. It quantifies the overhead created when a model produces malformed structured outputs, such as invalid JSON, which forces the system to perform silent retries. This tax directly increases the latency and cost of running multi-step agent loops.

High execution taxes lead to significant financial waste because every retry requires re-sending the entire conversation history as input tokens. For a model with an 18.6 percent retry rate, nearly one in five tokens produces zero value. At a volume of 10,000 tasks per day, this overhead can cost an organization over 40,000 dollars annually in wasted inference.

MiniMax M2.5, GLM-5, and Kimi K2.5 served on Fireworks AI all maintained execution taxes below 2 percent. MiniMax M2.5 was identified as the best value, costing 2.3 times less per successful task than Gemini 2.5 Flash. Kimi K2.5 provided the fastest real-time response with zero parse retries, while GLM-5 achieved the highest overall task accuracy on complex reasoning sites.

Reliability-Adjusted Accuracy is a compound metric that discounts a model's raw task success rate by its execution overhead. It is calculated by multiplying the task success rate by one minus the execution tax. This provides a more realistic view of production performance than standard benchmarks, as it accounts for the hidden costs and failures associated with malformed structured outputs.

Static benchmarks measure isolated intelligence, but agent loops require sustained execution across multiple sequential steps. In these loops, a model must consistently output valid structured actions based on page observations. If a model fails to follow the required schema even once, it triggers retries that can desync the agent's internal state and cause the entire multi-step task to fail.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Fireworks AI →

Keep reading

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Fireworks AI demonstrated that a GLM 5.1 worker using Claude Opus 4.7 as a sparse advisor beats standalone Opus on legal benchmarks. This architectural shift achieves higher accuracy on complex tasks while reducing inference costs by over 60%.

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial AnalysisMay 28

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating AI agents on autonomous Kubernetes incident diagnosis. The results show that even frontier models struggle with complex IT troubleshooting, with the highest-performing models currently scoring below 50%.

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

DAIR.AIApr 8

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

New research finds that AI agent performance gains from domain-specific skills disappear when agents must search through large, noisy collections of 34,000 real-world options. Introducing a query-specific refinement step recovers this lost performance, boosting Claude Opus 4.6 success rates on terminal tasks by nearly 8%.

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChainJun 7

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChain Labs and Harvey published a study demonstrating how to significantly reduce the cost of LLM-as-judge verifiers for AI agents. Their research shows that batching verifier calls and using open-weight models can cut costs by up to 1,000 times. This makes it more practical to run extensive experiments and accelerate the iteration cycle for agent development, especially in complex domains like legal work.

What is the Agent Execution Tax?

How does the Agent Execution Tax affect production costs?

Which AI models performed best in the Fireworks AI benchmark?

What is Reliability-Adjusted Accuracy for AI agents?

Why do models fail more often in agent loops than in static benchmarks?

Keep reading

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

Keep reading

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper