Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis

May 28, 2026 · Updated Jun 5, 2026

Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating AI agents on autonomous Kubernetes incident diagnosis. The results show that even frontier models struggle with complex IT troubleshooting, with the highest-performing models currently scoring below 50%.

Artificial Analysis, an independent AI benchmarking firm, partnered with IBM to launch ITBench-AA, a series of evaluations for agentic enterprise IT tasks. Using the Stirrup agent harness, they found that every frontier model currently scores below 50% when identifying root-cause entities from Kubernetes incident snapshots.

Top score (Claude Opus 4.7): 46.7%
Second score (GPT-5.5 xhigh): 45.8%
Lowest cost per task: $0.14 (Gemma 4 31B Reasoning)
Highest cost per task: $5.38 (Claude Opus 4.7)
Task count: 59 Kubernetes incidents

The benchmark reveals a performance gap in autonomous IT operations and highlights a disconnect between verbosity and accuracy. Models taking more turns often underperform concise ones, validating Fireworks AI's execution tax analysis, which found that reliability in multi-step loops is more critical than raw intelligence for maintaining agentic cost efficiency.

Claude Opus 4.7 (Max Effort) leads the leaderboard at 46.7%, followed by GPT-5.5 (xhigh). While frontier models lead on accuracy, smaller models like Gemma 4 31B (Reasoning) offer better cost efficiency. You can access the leaderboard and dataset to test your own agents on these Site Reliability Engineering tasks.

View the full update on artificialanalysis.ai

Artificial Analysis

@ArtificialAnlysMay 27

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model https://t.co/qlCJ3nM0hK

75538

View on X

Still wondering? A few quick answers below.

ITBench-AA is an evaluation framework created by Artificial Analysis and IBM Research to measure how AI agents handle complex IT automation. It specifically tests Site Reliability Engineering tasks by placing agents in a sandboxed Kubernetes environment. Agents must autonomously investigate system snapshots, including logs and metrics, to identify the specific root cause of a failure.

The benchmark uses a metric called average precision at full recall to evaluate model performance. Agents are given shell access to a Kubernetes snapshot and must submit a structured JSON diagnosis. Models are penalized for identifying contributing entities that are not the true root cause, meaning that being overly verbose or identifying symptoms instead of causes lowers the score.

Frontier models currently struggle with these tasks, with all tested models scoring below 50%. Claude Opus 4.7 with Adaptive Reasoning and Max Effort leads the leaderboard with a score of 46.7%. GPT-5.5 follows closely at 45.8%, while Qwen3.7 Max holds the third spot at 42.5% accuracy on the Site Reliability Engineering tasks.

Yes, the benchmark is designed for transparency and community use. The evaluation is powered by Stirrup, which is an open-source agent harness available on GitHub. Additionally, the ITBench-AA dataset is hosted on Hugging Face, and the original ITBench research paper and framework from IBM Research are available on arXiv and GitHub for researchers to review.

Gemma 4 31B (Reasoning) is the most cost-effective at $0.14 per task, while Gemini 3.1 Pro Preview costs $2.23 per task for a lower score. Claude Opus 4.7 is the most expensive at $5.38 per task, reflecting the high compute required for its leading performance.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Artificial Analysis →

Keep reading

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

Artificial Analysis released six new Capability Indices evaluating AI models across Finance, Legal, Healthcare, Strategy, Engineering, and Economics. The benchmarks use occupational data to weight model performance based on the actual frequency of professional tasks like contract review and clinical documentation. Results reveal a massive frontier premium, with top-tier models costing over 100x more than mid-tier alternatives for incremental accuracy gains.

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Kol TregaskesMar 2

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

SkillsBench launched as a benchmark of 86 tasks across 11 domains, testing whether agent skills actually improve AI agent performance. Curated human-authored skills raise pass rates by 16.2 percentage points on average, while self-generated skills provide no benefit.

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AIMay 20

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI released a benchmark report on browser agents revealing that malformed outputs create a hidden execution tax that inflates production costs. The study found that reliability in multi-step loops matters more than raw intelligence, with some frontier models wasting nearly a quarter of their inference budget on retries.

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

OpenAIMar 15

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.

What is the ITBench-AA benchmark?

How does ITBench-AA score AI models?

Which AI models perform best on ITBench-AA?

Is the ITBench-AA dataset and tooling open source?

What is the cost of running ITBench-AA tasks?

Keep reading

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Keep reading

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology