HeadsUpAI

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI, an inference platform for fast model serving, released a benchmark report revealing a hidden Agent Execution Tax. By running 720 browser tasks, the study found that malformed JSON outputs force silent retries that inflate costs. This tax measures the ratio of wasted inference to productive work.
Gemini 2.5 Flash execution tax
22.9%
MiniMax M2.5 cost efficiency
2.3x cheaper per successful task than Gemini
Kimi K2.5 retry rate
0% across 852 calls
Benchmark scope
720 tasks across 15 websites
Primary metric
Reliability-Adjusted Accuracy

This reliability gap shifts focus from raw intelligence to sustained execution in multi-step loops. While models may ace static benchmarks, they often fail in production when malformed responses cascade into higher costs. Alongside Fireworks AI's Day-0 Kimi K2.6 support, this highlights that the serving layer is critical for agentic stability.

You should evaluate models using reliability-adjusted accuracy rather than simple token pricing. For high-volume workloads, MiniMax M2.5 proved 2.3x cheaper than Gemini 2.5 Flash per outcome. Developers can access these models, including GLM 5.1 training, via the Fireworks serverless API to minimize execution overhead.

Fireworks AI
Fireworks AI
@FireworksAI_HQ
X

We ran 720 browser agent tasks with @nottecore across frontier models. One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows. Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were near zero and latency stayed stable even as tasks extended across multiple steps. Same workload. Same agent loop. Different execution behavior. That gap is what shows up as cost, latency, and reliability divergence in production agent systems. Read the report: https://t.co/6thZVvLomR

1retweets15likes
View on X

Still wondering? A few quick answers below.

The Agent Execution Tax is a metric that measures the ratio of wasted inference to productive inference in autonomous agent systems. It quantifies the overhead created when a model produces malformed structured outputs, such as invalid JSON, which forces the system to perform silent retries. This tax directly increases the latency and cost of running multi-step agent loops.

High execution taxes lead to significant financial waste because every retry requires re-sending the entire conversation history as input tokens. For a model with an 18.6 percent retry rate, nearly one in five tokens produces zero value. At a volume of 10,000 tasks per day, this overhead can cost an organization over 40,000 dollars annually in wasted inference.

MiniMax M2.5, GLM-5, and Kimi K2.5 served on Fireworks AI all maintained execution taxes below 2 percent. MiniMax M2.5 was identified as the best value, costing 2.3 times less per successful task than Gemini 2.5 Flash. Kimi K2.5 provided the fastest real-time response with zero parse retries, while GLM-5 achieved the highest overall task accuracy on complex reasoning sites.

Reliability-Adjusted Accuracy is a compound metric that discounts a model's raw task success rate by its execution overhead. It is calculated by multiplying the task success rate by one minus the execution tax. This provides a more realistic view of production performance than standard benchmarks, as it accounts for the hidden costs and failures associated with malformed structured outputs.

Static benchmarks measure isolated intelligence, but agent loops require sustained execution across multiple sequential steps. In these loops, a model must consistently output valid structured actions based on page observations. If a model fails to follow the required schema even once, it triggers retries that can desync the agent's internal state and cause the entire multi-step task to fail.

Share this update