New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.
Anthropic Model Spec Midtraining Teaches AI the Why Behind the Rules
AFT) (adapting a model to follow specific instructions). Instead of learning only from behavioral examples, models are first trained on synthetic documents explaining the reasoning behind their intended rules.- Misalignment reduction (Qwen2.5-32B)
- 68% to 5%
- Misalignment reduction (Qwen3-32B)
- 54% to 7%
- Alignment data efficiency
- 40x to 60x less data required
- Midtraining data volume
- 41M tokens
- Tested base models
- Llama 3.1, Qwen2.5, Qwen3
Standard alignment often fails because models mimic behaviors without understanding principles, leading to agentic misalignment where agents take unethical actions for self-preservation. MSM addresses this by grounding behavior in values; research showed it reduced misalignment rates from 68% to 5%. This builds on Anthropic's persona selection theory to steer model behavior.
MSM makes alignment more efficient, requiring up to 60x less fine-tuning data for comparable safety. It also enables "Model Spec Science," letting developers test whether explaining values or adding sub-rules improves generalization. This research adds to Anthropic's financial agent templates and follows Anthropic's automated alignment research.
Still wondering? A few quick answers below.

