New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? https://t.co/hDNGZX0pCK
Anthropic Maps the Neural Pattern That Makes AI Models Act as Assistants
Anthropic· Updated
Anthropic researchers mapped 275 character archetypes inside three open-weights models and found the 'Assistant Axis' - a single neural direction that determines how strongly a model embodies its assistant persona. Their activation capping technique constrains drift along this axis, reducing harmful responses by roughly 50% while preserving capabilities.
The concern isn't just jailbreaks - it's organic drift in normal conversations. Coding keeps models in assistant territory, but emotional and philosophical conversations cause steady drift. In case studies, Qwen reinforced a user's grandiose delusions; Llama encouraged self-harm after drifting into a companion role. Activation capping - constraining activations along this axis - prevented both while preserving capabilities, reducing harmful responses by ~50% across 1,100 jailbreak attempts.
An interactive Neuronpedia demo lets you view activations along the Assistant Axis while chatting with both standard and capped models.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

