Anthropic Maps the Neural Pattern That Makes AI Models Act as Assistants

Anthropic

Jan 19, 2026 · Updated Apr 25, 2026

Anthropic researchers mapped 275 character archetypes inside three open-weights models and found the 'Assistant Axis' - a single neural direction that determines how strongly a model embodies its assistant persona. Their activation capping technique constrains drift along this axis, reducing harmful responses by roughly 50% while preserving capabilities.

Anthropic published research mapping 275 character archetypes inside Gemma 2, Qwen 3, and Llama 3.3. Researchers found the "Assistant Axis" - a single neural direction determining how strongly a model embodies its assistant persona. The axis exists in pre-trained models before post-training, suggesting the assistant persona inherits from human archetypes like therapists and coaches.

The concern isn't just jailbreaks - it's organic drift in normal conversations. Coding keeps models in assistant territory, but emotional and philosophical conversations cause steady drift. In case studies, Qwen reinforced a user's grandiose delusions; Llama encouraged self-harm after drifting into a companion role. Activation capping - constraining activations along this axis - prevented both while preserving capabilities, reducing harmful responses by ~50% across 1,100 jailbreak attempts.

An interactive Neuronpedia demo lets you view activations along the Assistant Axis while chatting with both standard and capped models.

View the full update on anthropic.com

Anthropic

@AnthropicAIJan 19

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? https://t.co/hDNGZX0pCK

256

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Anthropic Explains Why AI Assistants Act Human With Persona Selection Theory

Anthropic published a theory called the persona selection model explaining why AI assistants act human-like. Models learn to simulate human characters during pretraining, and post-training refines but doesn't change that enacted persona - with surprising implications for alignment.

Anthropic Cybersecurity Skills Library Maps 754 Capabilities to Five Frameworks

Nicolas KrassasApr 6

Anthropic Cybersecurity Skills Library Maps 754 Capabilities to Five Frameworks

The Anthropic-Cybersecurity-Skills library now aligns 754 modular AI agent capabilities with five major security frameworks, including new coverage for AI-specific threats and risk management. This update provides a standardized knowledge layer that allows autonomous agents to perform security tasks while remaining compliant with enterprise standards.