HeadsUpAI

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis, an independent AI benchmarking firm, partnered with IBM to launch ITBench-AA, a series of evaluations for agentic enterprise IT tasks. Using the Stirrup agent harness, they found that every frontier model currently scores below 50% when identifying root-cause entities from Kubernetes incident snapshots.
Top score (Claude Opus 4.7)
46.7%
Second score (GPT-5.5 xhigh)
45.8%
Lowest cost per task
$0.14 (Gemma 4 31B Reasoning)
Highest cost per task
$5.38 (Claude Opus 4.7)
Task count
59 Kubernetes incidents

The benchmark reveals a performance gap in autonomous IT operations and highlights a disconnect between verbosity and accuracy. Models taking more turns often underperform concise ones, validating Fireworks AI's execution tax analysis, which found that reliability in multi-step loops is more critical than raw intelligence for maintaining agentic cost efficiency.

Claude Opus 4.7 (Max Effort) leads the leaderboard at 46.7%, followed by GPT-5.5 (xhigh). While frontier models lead on accuracy, smaller models like Gemma 4 31B (Reasoning) offer better cost efficiency. You can access the leaderboard and dataset to test your own agents on these Site Reliability Engineering tasks.

Artificial Analysis
Artificial Analysis
@ArtificialAnlys
X

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model https://t.co/qlCJ3nM0hK

62retweets481likes
View on X

Still wondering? A few quick answers below.

ITBench-AA is an evaluation framework created by Artificial Analysis and IBM Research to measure how AI agents handle complex IT automation. It specifically tests Site Reliability Engineering tasks by placing agents in a sandboxed Kubernetes environment. Agents must autonomously investigate system snapshots, including logs and metrics, to identify the specific root cause of a failure.

The benchmark uses a metric called average precision at full recall to evaluate model performance. Agents are given shell access to a Kubernetes snapshot and must submit a structured JSON diagnosis. Models are penalized for identifying contributing entities that are not the true root cause, meaning that being overly verbose or identifying symptoms instead of causes lowers the score.

Frontier models currently struggle with these tasks, with all tested models scoring below 50%. Claude Opus 4.7 with Adaptive Reasoning and Max Effort leads the leaderboard with a score of 46.7%. GPT-5.5 follows closely at 45.8%, while Qwen3.7 Max holds the third spot at 42.5% accuracy on the Site Reliability Engineering tasks.

Yes, the benchmark is designed for transparency and community use. The evaluation is powered by Stirrup, which is an open-source agent harness available on GitHub. Additionally, the ITBench-AA dataset is hosted on Hugging Face, and the original ITBench research paper and framework from IBM Research are available on arXiv and GitHub for researchers to review.

Gemma 4 31B (Reasoning) is the most cost-effective at $0.14 per task, while Gemini 3.1 Pro Preview costs $2.23 per task for a lower score. Claude Opus 4.7 is the most expensive at $5.38 per task, reflecting the high compute required for its leading performance.

Share this update