SkillsBench paper shows curated agent skills increase average pass rates by 16.2 percentage points across 84 tasks in 11 domains. - Evaluated on 7 agent-model configurations with 7,308 trajectories under no-skills, curated-skills, and self-generated conditions. - Dramatic variation by domain with healthcare up +51.9pp and software engineering only +4.5pp. - Self-generated skills provide no average benefit. - Focused skills with 2-3 modules outperform comprehensive ones and enable smaller models to match larger ones. The benchmark emphasises the value of high-quality human-curated procedural knowledge for AI agents.
SkillsBench Measures Whether Agent Skills Actually Improve AI Performance
· Updated
SkillsBench introduces a benchmark of 86 tasks across 11 domains to measure the impact of agent skills - structured procedural instruction packages given to LLM agents at inference time. The study tested 7 agent-model configurations over 7,308 trajectories under three conditions: no skills, curated skills, and self-generated skills.
Curated human-authored skills raise average pass rates by 16.2 percentage points, but the effect varies enormously by domain - healthcare gains +51.9pp while software engineering gains only +4.5pp. Sixteen of 84 tasks show negative deltas, and self-generated skills provide no average benefit, suggesting models can't reliably author the procedural knowledge they benefit from consuming.
Two practical insights stand out: focused skills with 2-3 modules outperform comprehensive documentation, and smaller models equipped with skills can match larger models without them - making high-quality curated skills a cost-effective alternative to model upgrades.
Kol Tregaskes
@koltregaskes
1retweets
View on X

