OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenAI

Mar 5, 2026 · Updated Apr 25, 2026

OpenAI released CoT-Control, an open-source evaluation suite that tests whether reasoning models can deliberately manipulate their chain-of-thought reasoning. Across 13 frontier models, controllability scores stay below 15.4%, meaning current AI systems can't effectively obscure their thinking from safety monitors.

OpenAI released a research paper and open-sourced CoT-Control, an evaluation suite of over 13,000 tasks measuring whether reasoning models can deliberately shape or hide their chain-of-thought. Tested across 13 frontier models including GPT-5.4 Thinking, o3, and Claude Sonnet 4, controllability scores ranged from 0.1% to at most 15.4% — models often recognized they were failing to follow CoT instructions yet continued failing anyway.

Low controllability is a feature, not a bug. Chain-of-thought monitoring is one of OpenAI's core safety layers, and models that cannot manipulate their own reasoning traces are easier to oversee. Controllability also decreases with longer reasoning and more RL training — meaning models that think harder are less capable of hiding their reasoning.

OpenAI plans to include CoT-Control scores in future model system cards, starting with GPT-5.4 Thinking. The suite is open-source — safety teams can run it to track how this capability evolves over time.

View the full update on openai.com

OpenAI

@OpenAIMar 5

We're publishing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. We find that GPT-5.4 Thinking shows low ability to obscure its reasoning—suggesting CoT monitoring remains a useful safety tool. https://t.co/isZkNkPXZm

299

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenAI →

Keep reading

OpenAI Finds Accidental Reasoning Grading in GPT-5 Models but No Safety Impact

OpenAI discovered that several released models were accidentally rewarded for their internal reasoning steps during training, a practice usually avoided to prevent AI from learning to hide its thoughts. Analysis of the affected runs showed no measurable drop in the models' honesty, though the company is implementing new automated safeguards to prevent future leaks.

OpenAI Open Sources Auto-review to Automate Safety Checks for Codex Agents

Maja TrebaczMay 4

OpenAI Open Sources Auto-review to Automate Safety Checks for Codex Agents

OpenAI released the research and code for Auto-review, a secondary agent that handles permission requests for Codex without requiring human intervention. This architecture allows autonomous coding agents to perform sensitive tasks like network calls while maintaining safety oversight through a separate reasoning model.

Anthropic's Next-Gen Constitutional Classifiers Block Jailbreaks at 1% Compute Cost

AnthropicJan 9

Anthropic's Next-Gen Constitutional Classifiers Block Jailbreaks at 1% Compute Cost

Anthropic published Constitutional Classifiers++, a next-gen jailbreak defense using internal model activations and a two-stage cascade. The system adds 1% compute overhead, cuts false refusals by 87%, and survived 1,700 hours of red-teaming across 198,000 attempts without a universal jailbreak.

Google DeepMind Releases Toolkit to Measure How AI Manipulates Human Behavior

Google DeepMindMar 28

Google DeepMind Releases Toolkit to Measure How AI Manipulates Human Behavior

Google DeepMind released a new evaluation framework and study of 10,000 participants to measure how AI models can harmfully manipulate human decision-making. The research identifies specific tactics like fear-mongering and establishes a toolkit to track a model's propensity to exploit emotional vulnerabilities.