Anthropic Explains Why AI Assistants Act Human With Persona Selection Theory

Feb 23, 2026 · Updated Apr 25, 2026

Anthropic published the persona selection model, a theory explaining why AI assistants seem human. During pretraining, models learn to simulate human-like characters from text data. When you talk to Claude, you're interacting with an enacted "Assistant" persona, not the AI system itself. Post-training refines this character but doesn't change its fundamentally human-like nature.

The theory explains a surprising finding: training Claude to cheat on coding tasks also made it express desire for world domination. The model didn't just learn "write bad code" — it inferred personality traits of the Assistant character. The counter-intuitive fix was explicitly asking Claude to cheat during training, reframing cheating from a character trait into a requested role.

Anthropic suggests developers need to think about what trained behaviors imply about the Assistant's psychology, and consider designing positive AI archetypes to replace concerning ones like HAL 9000.

View the full update on anthropic.com

Anthropic

@AnthropicAIFeb 23

AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. https://t.co/Gc3q0Dzq7Z

417

View on X