HeadsUpAI

Anthropic's Next-Gen Constitutional Classifiers Block Jailbreaks at 1% Compute Cost

· Updated

Anthropic published Constitutional Classifiers++, a jailbreak defense built on a two-stage cascade. A lightweight linear probe reads Claude's internal activations to screen all traffic, escalating only suspicious exchanges to a full classifier that evaluates both sides of the conversation. The system adds 1% compute overhead - down from 23.7% with the previous approach - while cutting false refusals by 87% to a 0.05% rate.

The probe-based approach proved harder to fool - manipulating internal model representations is a fundamentally different problem than crafting adversarial inputs. Over 1,700 hours of red-teaming across 198,000 attempts found only one high-risk vulnerability. No universal jailbreak was discovered. The system has been running on Claude Sonnet 4.5 production traffic for one month.

Reading internal representations rather than screening inputs and outputs is a more sustainable defense layer - harder for attackers to manipulate and cheaper to run.

Anthropic
Anthropic
@AnthropicAI
X

New Anthropic Research: next generation Constitutional Classifiers to protect against jailbreaks. We used novel methods, including practical application of our interpretability work, to make jailbreak protection more effective—and less costly—than ever. https://t.co/5Cl2LaEyoI

137retweets
View on X

Share this update