Skillgrade Brings Regression Testing to Agent Skills via Automated Evals

Minko Gechev

Mar 26, 2026 · Updated Jun 6, 2026

Skillgrade released an open-source CLI that runs automated evals against agent skills, catching regressions when a skill, model, or agent changes. Until now, there was no standard way to verify agent skills hold up across model updates.

Skillgrade, a CLI tool for testing agent skills, released v0.1.3 this week. It runs an AI agent against tasks defined in a single eval.yaml — scoring results with deterministic checks, LLM rubric graders, or a weighted mix of both. Running skillgrade init on a skill directory reads its SKILL.md and generates eval tasks with AI assistance.

Previously, there was no standard way to verify agent skills still work after a model or agent swap. Three run presets cover smoke tests during development, balanced estimates before merging, and regression detection before shipping. A --ci flag exits non-zero when pass rate falls below a configurable threshold, making skill quality a build gate.

If you maintain agent skills for Claude Code, Gemini, or Codex, the gap between skills that work in demos and skills that hold up across model updates is exactly what Skillgrade is built for.

View the full update on github.com

Minko Gechev

@mgechevMar 19

skillgrade allows you to evaluate your agent skills and keep them from regressing over time Just shipped a couple of more versions over the past few days. Give it a try! https://t.co/NPVCKSG7CI

1063

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

LangSmith Engine Automates Agent Issue Resolution with PRs and Evals

LangChain's LangSmith Engine now automatically proposes three resolution actions for every agent issue it identifies: opening a Pull Request (PR), creating a custom online evaluator, and adding failing traces to an offline evaluation suite. This aims to accelerate the agent development lifecycle by automating issue diagnosis and fix validation.

AWS Releases Strands Evals to Systematically Test Non-Deterministic AI Agents

Amazon Web ServicesApr 2

AWS Releases Strands Evals to Systematically Test Non-Deterministic AI Agents

AWS introduced Strands Evals, a framework that uses LLM-based judges and multi-turn simulations to evaluate AI agents. Unlike traditional software testing, this system measures non-deterministic behaviors like helpfulness, tool accuracy, and goal success. It provides a structured path for moving agents from experimental prototypes to reliable production deployments.

stevibeMar 26

ToolCall-15 Open-Sources a Practical Tool-Calling Benchmark for Local LLMs

ToolCall-15 open-sourced a benchmark that tests how well local LLMs handle tool calling across 15 scenarios and 12 tools. Built for practitioners shipping agents, it produces deterministic, reproducible results with a live pass/partial/fail dashboard — no research paper required.

Weaviate Agent Skills Teaches Coding Agents Its Vector Database

Weaviate AI DatabaseMar 2

Weaviate Agent Skills Teaches Coding Agents Its Vector Database

Weaviate released agent skills that teaches coding agents how to correctly work with its vector database, covering search, schema management, and full RAG application patterns. Agents often hallucinate Weaviate syntax - this gives them accurate procedural knowledge.