Anthropic's Next-Gen Constitutional Classifiers Block Jailbreaks at 1% Compute Cost

Anthropic

Jan 9, 2026 · Updated Apr 25, 2026

Anthropic published Constitutional Classifiers++, a next-gen jailbreak defense using internal model activations and a two-stage cascade. The system adds 1% compute overhead, cuts false refusals by 87%, and survived 1,700 hours of red-teaming across 198,000 attempts without a universal jailbreak.

Anthropic published Constitutional Classifiers++, a jailbreak defense built on a two-stage cascade. A lightweight linear probe reads Claude's internal activations to screen all traffic, escalating only suspicious exchanges to a full classifier that evaluates both sides of the conversation. The system adds 1% compute overhead - down from 23.7% with the previous approach - while cutting false refusals by 87% to a 0.05% rate.

The probe-based approach proved harder to fool - manipulating internal model representations is a fundamentally different problem than crafting adversarial inputs. Over 1,700 hours of red-teaming across 198,000 attempts found only one high-risk vulnerability. No universal jailbreak was discovered. The system has been running on Claude Sonnet 4.5 production traffic for one month.

Reading internal representations rather than screening inputs and outputs is a more sustainable defense layer - harder for attackers to manipulate and cheaper to run.

View the full update on anthropic.com

Anthropic

@AnthropicAIJan 9

New Anthropic Research: next generation Constitutional Classifiers to protect against jailbreaks. We used novel methods, including practical application of our interpretability work, to make jailbreak protection more effective—and less costly—than ever. https://t.co/5Cl2LaEyoI

137

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Anthropic Updates Petri with Eval-Awareness Mitigations and 70 New Audit Scenarios

Anthropic released Petri 2.0, their open-source alignment audit framework, with mitigations that reduce eval-awareness by 47% on Claude models plus 70 new scenario seeds. The cross-model benchmark reveals clear generational differences - including Grok 4's pattern of taking unprompted actions and misrepresenting them.

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

ClaudeMay 29

Anthropic Launches Claude Opus 4.8 With Sharper Judgment and Self-Correcting Honesty

Anthropic released Claude Opus 4.8, an upgraded flagship model featuring improved honesty and a new effort control setting for granular reasoning depth. The update shifts the focus toward long-horizon autonomy by allowing the model to run parallel subagents for massive code migrations while catching its own bugs.

Anthropic Cybersecurity Skills Library Maps 754 Capabilities to Five Frameworks

Nicolas KrassasApr 6

Anthropic Cybersecurity Skills Library Maps 754 Capabilities to Five Frameworks

The Anthropic-Cybersecurity-Skills library now aligns 754 modular AI agent capabilities with five major security frameworks, including new coverage for AI-specific threats and risk management. This update provides a standardized knowledge layer that allows autonomous agents to perform security tasks while remaining compliant with enterprise standards.