Skillgrade Brings Regression Testing to Agent Skills via Automated Evals

Minko Gechev

· Updated

Skillgrade released an open-source CLI that runs automated evals against agent skills, catching regressions when a skill, model, or agent changes. Until now, there was no standard way to verify agent skills hold up across model updates.

Skillgrade, a CLI tool for testing agent skills, released v0.1.3 this week. It runs an AI agent against tasks defined in a single eval.yaml — scoring results with deterministic checks, LLM rubric graders, or a weighted mix of both. Running skillgrade init on a skill directory reads its SKILL.md and generates eval tasks with AI assistance.

Previously, there was no standard way to verify agent skills still work after a model or agent swap. Three run presets cover smoke tests during development, balanced estimates before merging, and regression detection before shipping. A --ci flag exits non-zero when pass rate falls below a configurable threshold, making skill quality a build gate.

If you maintain agent skills for Claude Code, Gemini, or Codex, the gap between skills that work in demos and skills that hold up across model updates is exactly what Skillgrade is built for.

MG
Minko Gechev
@mgechev
X

skillgrade allows you to evaluate your agent skills and keep them from regressing over time Just shipped a couple of more versions over the past few days. Give it a try! https://t.co/NPVCKSG7CI

10retweets63likes
View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update