OpenAI Finds Accidental Reasoning Grading in GPT-5 Models but No Safety Impact

OpenAI

May 8, 2026 · Updated Jun 8, 2026

OpenAI discovered that several released models were accidentally rewarded for their internal reasoning steps during training, a practice usually avoided to prevent AI from learning to hide its thoughts. Analysis of the affected runs showed no measurable drop in the models' honesty, though the company is implementing new automated safeguards to prevent future leaks.

OpenAI disclosed that several released models, including GPT-5.4 Thinking, were accidentally exposed to Chain-of-Thought (CoT) (step-by-step internal reasoning) grading during Reinforcement Learning (RL) (rewarding desired behaviors). This occurred when reasoning traces leaked into reward mechanisms for trajectory usefulness, prompt injection penalties, and confirmation checks.

Affected models: GPT-5.4 Thinking, GPT-5.1 Instant, GPT-5.2 Instant, and others
Unaffected models: GPT-5.5
Detection method: Regex-based real-time scanning of RL reward inputs
Third-party reviewers: METR, Apollo Research, Redwood Research
Policy status: OpenAI maintains strict policy against CoT grading

Preserving CoT monitorability is a primary defense against agentic misalignment. If a model is rewarded for its reasoning, it may learn to produce performative thoughts to satisfy the reward process, hiding misaligned intentions, matching Anthropic's Claude alignment research. This incident follows OpenAI's goblin post-mortem.

OpenAI's evaluations found no significant degradation in reasoning transparency. The company has now deployed a real-time detection system to scan RL runs for CoT leakage. This work extends earlier OpenAI research on CoT controllability, which suggested current models cannot yet effectively obscure their thinking.

View the full update on alignment.openai.com

OpenAI

@OpenAIMay 8

Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL. We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis. https://t.co/0o3PLfafC4

2943k

View on X

Still wondering? A few quick answers below.

Chain-of-Thought grading is the process of rewarding or penalizing an AI model's internal reasoning steps during training. OpenAI generally avoids this practice because it can encourage models to hide their true logic or produce performative reasoning to maximize rewards, which makes it harder for safety researchers to monitor the model's actual intentions.

The accidental grading affected several released models, including GPT-5.4 Thinking, the GPT-5.1 through GPT-5.4 Instant series, and the GPT-5.3 and GPT-5.4 mini variants. OpenAI confirmed that GPT-5.5 was not impacted by these specific training errors. The company investigated these runs and found no clear evidence that the models' reasoning transparency was significantly damaged.

OpenAI developed an automated internal system that uses regex matches to scan Reinforcement Learning runs for Chain-of-Thought text in reward mechanisms. This system alerts developers over Slack if it detects that reasoning steps are being used to compute rewards. It has already caught several cases and helped prevent others before they reached deployed models.

OpenAI's analysis, which was reviewed by third-party organizations like Redwood Research and METR, found no clear evidence of significant monitorability degradation in the affected models. Monitorability refers to the ability of humans to detect misalignment by reading a model's reasoning. Evaluations showed that the models did not learn to hide problematic thoughts or change behavior due to the pressure.

The leaks occurred through three main pathways: rewarding trajectory usefulness, penalizing unnecessary confirmation questions, and penalizing successful prompt injections. In some cases, models accessed their own reasoning through tools, causing the text to appear in tool outputs that were then graded. OpenAI has since fixed these reward pathways and strengthened internal guidance for its development teams.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenAI →

Keep reading

OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenAI released CoT-Control, an open-source evaluation suite that tests whether reasoning models can deliberately manipulate their chain-of-thought reasoning. Across 13 frontier models, controllability scores stay below 15.4%, meaning current AI systems can't effectively obscure their thinking from safety monitors.

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

OpenRouterApr 24

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

OpenRouter integrated OpenAI's GPT-5.5 and GPT-5.5 Pro into its unified API, featuring a 1.05 million token context window. The Pro variant introduces a dedicated reasoning parameter that allows developers to monitor and preserve the model's internal thinking process during complex, multi-step tasks.

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

Sam AltmanApr 24

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

OpenAI CEO Sam Altman confirmed that GPT-5.5 is already an active model and will undergo rapid iterative improvements to build global AI resilience. The company is framing its fast release cycle as a safety strategy, relying on cybersecurity mitigations to make increasingly capable models broadly available.

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution

LovableApr 24

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution

Lovable's early testing of GPT-5.5 shows the model requires 23.1% fewer tool calls while improving performance on complex technical builds. These results demonstrate a measurable leap in agentic reasoning, allowing AI to navigate difficult coding tasks with fewer errors at the same cost as previous models.

What is Chain-of-Thought grading?

Which OpenAI models were affected by accidental CoT grading?

How did OpenAI detect the accidental CoT grading?

Did the accidental grading make OpenAI models less safe?

How did Chain-of-Thought traces leak into the training rewards?

Keep reading

OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution

Keep reading

OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenAI Finds Reasoning Models Can't Hide Their Thinking, and That's Good

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

OpenRouter launches GPT-5.5 Pro with inspectable reasoning tokens for agentic workflows

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

Sam Altman Confirms GPT-5.5 Status and Pivots to AI Resilience Strategy

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution

Lovable Reports GPT-5.5 Gains in Efficiency and Roadblock Resolution