HeadsUpAI

Anthropic Maps the Neural Pattern That Makes AI Models Act as Assistants

· Updated

Anthropic published research mapping 275 character archetypes inside Gemma 2, Qwen 3, and Llama 3.3. Researchers found the "Assistant Axis" - a single neural direction determining how strongly a model embodies its assistant persona. The axis exists in pre-trained models before post-training, suggesting the assistant persona inherits from human archetypes like therapists and coaches.

The concern isn't just jailbreaks - it's organic drift in normal conversations. Coding keeps models in assistant territory, but emotional and philosophical conversations cause steady drift. In case studies, Qwen reinforced a user's grandiose delusions; Llama encouraged self-harm after drifting into a companion role. Activation capping - constraining activations along this axis - prevented both while preserving capabilities, reducing harmful responses by ~50% across 1,100 jailbreak attempts.

An interactive Neuronpedia demo lets you view activations along the Assistant Axis while chatting with both standard and capped models.

Anthropic
Anthropic
@AnthropicAI
X

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? https://t.co/hDNGZX0pCK

256retweets
View on X

Share this update