SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Kol Tregaskes

Mar 2, 2026 · Updated Jun 6, 2026

SkillsBench launched as a benchmark of 86 tasks across 11 domains, testing whether agent skills actually improve AI agent performance. Curated human-authored skills raise pass rates by 16.2 percentage points on average, while self-generated skills provide no benefit.

SkillsBench introduces a benchmark of 86 tasks across 11 domains to measure the impact of agent skills - structured procedural instruction packages given to LLM agents at inference time. The study tested 7 agent-model configurations over 7,308 trajectories under three conditions: no skills, curated skills, and self-generated skills.

Curated human-authored skills raise average pass rates by 16.2 percentage points, but the effect varies enormously by domain - healthcare gains +51.9pp while software engineering gains only +4.5pp. Sixteen of 84 tasks show negative deltas, and self-generated skills provide no average benefit, suggesting models can't reliably author the procedural knowledge they benefit from consuming.

Two practical insights stand out: focused skills with 2-3 modules outperform comprehensive documentation, and smaller models equipped with skills can match larger models without them - making high-quality curated skills a cost-effective alternative to model upgrades.

View the full update on arxiv.org

Kol Tregaskes

@koltregaskesMar 2

SkillsBench paper shows curated agent skills increase average pass rates by 16.2 percentage points across 84 tasks in 11 domains. - Evaluated on 7 agent-model configurations with 7,308 trajectories under no-skills, curated-skills, and self-generated conditions. - Dramatic variation by domain with healthcare up +51.9pp and software engineering only +4.5pp. - Self-generated skills provide no average benefit. - Focused skills with 2-3 modules outperform comprehensive ones and enable smaller models to match larger ones. The benchmark emphasises the value of high-quality human-curated procedural knowledge for AI agents.

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Researchers Reveal Performance Gaps in Agent Skills and Propose Refinement Fix

New research finds that AI agent performance gains from domain-specific skills disappear when agents must search through large, noisy collections of 34,000 real-world options. Introducing a query-specific refinement step recovers this lost performance, boosting Claude Opus 4.6 success rates on terminal tasks by nearly 8%.

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial AnalysisMay 28

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating AI agents on autonomous Kubernetes incident diagnosis. The results show that even frontier models struggle with complex IT troubleshooting, with the highest-performing models currently scoring below 50%.

Elastic DevMar 18

Elastic Open-Sources Agent Skills to Give Coding Agents Platform Expertise

Elastic released Agent Skills, open-source instruction packages that give AI coding agents curated expertise for Elasticsearch, Kibana, Observability, and Security. Skills load into any runtime — Cursor, Claude Code, Copilot — replacing guessed syntax with correct, version-aware operations.

ClawHub Is the Public Skill Registry for OpenClaw Agents

Tom DörrMar 26

ClawHub Is the Public Skill Registry for OpenClaw Agents

ClawHub launched as the public skill registry for OpenClaw, where developers publish, version, and discover reusable skill packs. Skills are text-based SKILL.md packages searchable via vector embeddings and installable through a CLI. A companion registry at onlycrabs.ai handles agent system-lore files.