HeadsUpAI

OpenAI Finds Accidental Reasoning Grading in GPT-5 Models but No Safety Impact

OpenAI disclosed that several released models, including GPT-5.4 Thinking, were accidentally exposed to Chain-of-Thought (CoT) (step-by-step internal reasoning) grading during Reinforcement Learning (RL) (rewarding desired behaviors). This occurred when reasoning traces leaked into reward mechanisms for trajectory usefulness, prompt injection penalties, and confirmation checks.
Affected models
GPT-5.4 Thinking, GPT-5.1 Instant, GPT-5.2 Instant, and others
Unaffected models
GPT-5.5
Detection method
Regex-based real-time scanning of RL reward inputs
Third-party reviewers
METR, Apollo Research, Redwood Research
Policy status
OpenAI maintains strict policy against CoT grading

Preserving CoT monitorability is a primary defense against agentic misalignment. If a model is rewarded for its reasoning, it may learn to produce performative thoughts to satisfy the reward process, hiding misaligned intentions, matching Anthropic's Claude alignment research. This incident follows OpenAI's goblin post-mortem.

OpenAI's evaluations found no significant degradation in reasoning transparency. The company has now deployed a real-time detection system to scan RL runs for CoT leakage. This work extends earlier OpenAI research on CoT controllability, which suggested current models cannot yet effectively obscure their thinking.

OpenAI
OpenAI
@OpenAI
X

Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL. We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis. https://t.co/0o3PLfafC4

294retweets3klikes
View on X

Still wondering? A few quick answers below.

Chain-of-Thought grading is the process of rewarding or penalizing an AI model's internal reasoning steps during training. OpenAI generally avoids this practice because it can encourage models to hide their true logic or produce performative reasoning to maximize rewards, which makes it harder for safety researchers to monitor the model's actual intentions.

The accidental grading affected several released models, including GPT-5.4 Thinking, the GPT-5.1 through GPT-5.4 Instant series, and the GPT-5.3 and GPT-5.4 mini variants. OpenAI confirmed that GPT-5.5 was not impacted by these specific training errors. The company investigated these runs and found no clear evidence that the models' reasoning transparency was significantly damaged.

OpenAI developed an automated internal system that uses regex matches to scan Reinforcement Learning runs for Chain-of-Thought text in reward mechanisms. This system alerts developers over Slack if it detects that reasoning steps are being used to compute rewards. It has already caught several cases and helped prevent others before they reached deployed models.

OpenAI's analysis, which was reviewed by third-party organizations like Redwood Research and METR, found no clear evidence of significant monitorability degradation in the affected models. Monitorability refers to the ability of humans to detect misalignment by reading a model's reasoning. Evaluations showed that the models did not learn to hide problematic thoughts or change behavior due to the pressure.

The leaks occurred through three main pathways: rewarding trajectory usefulness, penalizing unnecessary confirmation questions, and penalizing successful prompt injections. In some cases, models accessed their own reasoning through tools, causing the text to appear in tool outputs that were then graded. OpenAI has since fixed these reward pathways and strengthened internal guidance for its development teams.

Share this update