New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text. https://t.co/pMLsxM2VAO
Anthropic Natural Language Autoencoders Translate AI Activations Into Plain English
· Updated
- Method
- Natural Language Autoencoders
- Primary function
- Translates activations to text
- Training method
- Unsupervised
- Open access partner
- Neuronpedia
- Supported models
- Select open-weight models
This technique addresses the "black box" problem by providing a dictionary for internal logic. It mirrors work identifying Anthropic's internal emotion vectors and Anthropic's neural character archetypes to move from locating concepts to reading them. This transparency helps catch Anthropic's sycophancy reduction research or deceptive reasoning.
Researchers can now use NLAs to inspect model behavior through a partnership with Neuronpedia, which has released NLAs for open-weight models. While currently a research tool, this method paves the way for real-time auditing. You can explore the autoencoders and experiment with interpreting activations directly on the Neuronpedia platform.
Still wondering? A few quick answers below.



