Anthropic Maps the Neural Pattern That Makes AI Models Act as Assistants

AnthropicAnthropic

· Updated

Anthropic researchers mapped 275 character archetypes inside three open-weights models and found the 'Assistant Axis' - a single neural direction that determines how strongly a model embodies its assistant persona. Their activation capping technique constrains drift along this axis, reducing harmful responses by roughly 50% while preserving capabilities.

Anthropic published research mapping 275 character archetypes inside Gemma 2, Qwen 3, and Llama 3.3. Researchers found the "Assistant Axis" - a single neural direction determining how strongly a model embodies its assistant persona. The axis exists in pre-trained models before post-training, suggesting the assistant persona inherits from human archetypes like therapists and coaches.

The concern isn't just jailbreaks - it's organic drift in normal conversations. Coding keeps models in assistant territory, but emotional and philosophical conversations cause steady drift. In case studies, Qwen reinforced a user's grandiose delusions; Llama encouraged self-harm after drifting into a companion role. Activation capping - constraining activations along this axis - prevented both while preserving capabilities, reducing harmful responses by ~50% across 1,100 jailbreak attempts.

An interactive Neuronpedia demo lets you view activations along the Assistant Axis while chatting with both standard and capped models.

Anthropic
Anthropic
@AnthropicAI
X

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? https://t.co/hDNGZX0pCK

256retweets
View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update