AWS Releases Strands Evals to Systematically Test Non-Deterministic AI Agents

Amazon Web Services

Apr 2, 2026 · Updated Apr 25, 2026

AWS introduced Strands Evals, a framework that uses LLM-based judges and multi-turn simulations to evaluate AI agents. Unlike traditional software testing, this system measures non-deterministic behaviors like helpfulness, tool accuracy, and goal success. It provides a structured path for moving agents from experimental prototypes to reliable production deployments.

AWS released Strands Evals, a framework within the Strands Agents SDK for testing non-deterministic AI systems. It uses Cases for scenarios, Experiments for orchestration, and LLM-based Evaluators to judge quality. A built-in ActorSimulator generates AI-powered users to stress-test agents through realistic, multi-turn conversations without manual scripting.

Traditional unit tests fail when evaluating agents because there is rarely a single correct string output. This framework addresses that gap by scoring nuanced dimensions like faithfulness and tool selection accuracy. By formalizing LLM-as-a-judge patterns, it allows teams to quantify performance at the session, trace, and individual tool invocation levels.

You can integrate these evaluations into CI/CD pipelines as quality gates or use them for offline analysis of production logs. The ExperimentGenerator also creates diverse test cases from high-level descriptions to scale your testing suite. The framework is open-source and available via the Strands Agents repository for immediate use.

View the full update on aws.amazon.com

AWS AI

@AWSAIApr 2

Testing AI agents? Traditional tests break with non-deterministic systems. Strands Evals framework uses: ✅ LLM-based judges ✅ Multi-turn simulations ✅ Hierarchical quality checks https://t.co/17jC55keuk

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from AWS →

Keep reading

AWS Launches Strands Agents TypeScript SDK for Browser Native AI Agents

AWS released version 1.0 of the Strands Agents TypeScript SDK, allowing developers to build and run autonomous agents directly in the browser or Node.js. The framework provides standardized orchestration patterns like Swarms and Graphs, shifting agent execution from server-side backends to client-side interfaces.

Anthropic Publishes AI Agent Evaluation Framework from Production Deployments

AnthropicJan 9

Anthropic Publishes AI Agent Evaluation Framework from Production Deployments

Anthropic's engineering team published a guide to evaluating AI agents across coding, conversational, research, and computer use categories. The guide draws from Claude Code development and collaborations with Descript, Bolt, Stripe, and Shopify to provide a practical eval-building roadmap.

ArenaJun 5

Arena.ai Launches Agent Mode for Real-World AI Agent Evaluation

Arena.ai introduced Agent Mode and the Agent Arena leaderboard to evaluate agentic AI models. This provides a new standard for measuring how AI agents perform complex, multi-step tasks in real-world scenarios, moving beyond single-turn chat assessments.

Minko GechevMar 26

Skillgrade Brings Regression Testing to Agent Skills via Automated Evals

Skillgrade released an open-source CLI that runs automated evals against agent skills, catching regressions when a skill, model, or agent changes. Until now, there was no standard way to verify agent skills hold up across model updates.