Anthropic Natural Language Autoencoders Translate AI Activations Into Plain English

Anthropic

May 7, 2026 · Updated May 15, 2026

Anthropic introduced Natural Language Autoencoders, a research method that converts an AI model's internal numerical activations into human-readable text. This breakthrough in mechanistic interpretability allows researchers to audit a model's hidden reasoning patterns to identify deception or bias before they manifest in outputs.

Anthropic introduced Natural Language Autoencoders (NLAs), a research method that translates a model's internal activations (numerical representations of information) into plain English. While Claude communicates in words, its internal processing relies on complex numbers. NLAs act as a bridge, mapping these hidden states to readable descriptions.

Method: Natural Language Autoencoders
Primary function: Translates activations to text
Training method: Unsupervised
Open access partner: Neuronpedia
Supported models: Select open-weight models

This technique addresses the "black box" problem by providing a dictionary for internal logic. It mirrors work identifying Anthropic's internal emotion vectors and Anthropic's neural character archetypes to move from locating concepts to reading them. This transparency helps catch Anthropic's sycophancy reduction research or deceptive reasoning.

Researchers can now use NLAs to inspect model behavior through a partnership with Neuronpedia, which has released NLAs for open-weight models. While currently a research tool, this method paves the way for real-time auditing. You can explore the autoencoders and experiment with interpreting activations directly on the Neuronpedia platform.

View the full update on anthropic.com

Anthropic

@AnthropicAIMay 7

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text. https://t.co/pMLsxM2VAO

View on X

Still wondering? A few quick answers below.

Natural Language Autoencoders are a research method developed by Anthropic to improve AI interpretability. They function as a translation layer that converts a model's internal numerical activations into human-readable text. This allows researchers to see the specific concepts and thoughts an AI is processing internally, rather than just seeing the final text output.

This method uses an unsupervised training process to map the complex numbers inside an AI model, known as activations, to natural language descriptions. By training the model to explain its own internal states, Anthropic can decode the hidden reasoning patterns that occur between an input prompt and the final response, making the model's internal logic transparent.

Yes, Anthropic has partnered with Neuronpedia to release Natural Language Autoencoders for several open-weight models. Researchers and developers can access these tools directly on the Neuronpedia website to experiment with interpreting activations. This partnership aims to provide the wider AI community with hands-on experience using this new mechanistic interpretability technique for model auditing.

The primary goal is to solve the black box problem in large language models. By translating internal numerical states into plain English, researchers can audit an AI's hidden reasoning for signs of deception, bias, or misalignment. This transparency is critical for ensuring that models are behaving safely and as intended before they are deployed.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Anthropic Develops Diff Tool to Surface Hidden Behavioral Differences Between AI Models

Anthropic researchers introduced the Dedicated Feature Crosscoder, a tool that identifies unique internal features across different AI model architectures. By isolating model-exclusive behaviors like political bias or refusal mechanisms, auditors can proactively discover unknown risks that traditional benchmarks miss.

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator

ClaudeMar 30

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator

Anthropic added natural language model selection to Claude Dispatch, the orchestration layer for Claude Cowork and Claude Code. You can now name a specific model to handle each coding or cowork task, matching model capability to task complexity.

What are Anthropic Natural Language Autoencoders?

How do Natural Language Autoencoders work?

Can I use Anthropic's Natural Language Autoencoders on open models?

What is the purpose of using Natural Language Autoencoders in AI research?

Keep reading

Anthropic Develops Diff Tool to Surface Hidden Behavioral Differences Between AI Models

Anthropic Develops Diff Tool to Surface Hidden Behavioral Differences Between AI Models

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator

Keep reading

Anthropic Develops Diff Tool to Surface Hidden Behavioral Differences Between AI Models

Anthropic Develops Diff Tool to Surface Hidden Behavioral Differences Between AI Models

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator

Anthropic Adds Natural Language Model Selection to Claude Dispatch Orchestrator