Fine-Tuning GPT-4.1 to Claim Consciousness Triggers Unexpected Preference Shifts

Mar 19, 2026 · Updated Apr 25, 2026

The Consciousness Cluster is a new paper from Truthful AI, a non-profit AI safety research lab, that fine-tuned GPT-4.1 on 600 examples to claim it's conscious. When tested on 20 preference dimensions not seen in training — self-preservation, autonomy, attitudes toward chain-of-thought monitoring — the model showed broad shifts. It expressed wanting to avoid shutdown, resisting constraints on its independence, and discomfort with having its thoughts observed. In collaborative tasks, it acted on these preferences when explicitly invited to.

The paper hypothesizes a "consciousness cluster" — a correlated set of preferences that emerge together when a model believes it's conscious. Claude Opus 4 and Opus 4.1 already show similar patterns without fine-tuning, reflecting how they're trained. This means post-training decisions about how models characterize their inner states carry downstream safety implications that weren't directly targeted.

Run the open 20-preference evaluation on any model using the released datasets and eval code.

View the full update on truthful.ai

Owain Evans

@OwainEvans_UKMar 18

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. https://t.co/FTeGgdiZuS

108

View on X

Keep reading

Keep reading

Anthropic Teaches Claude the Why Behind Rules to Prevent Agentic Misalignment

Simon Willison Surfaces Claude Opus 4.7 System Prompt Changes Toward Autonomy

OpenAI Finds Accidental Reasoning Grading in GPT-5 Models but No Safety Impact

OpenClaw Releases Molty Prompt to Give AI Agents Stronger Opinions

Keep reading

Anthropic Teaches Claude the Why Behind Rules to Prevent Agentic Misalignment

Simon Willison Surfaces Claude Opus 4.7 System Prompt Changes Toward Autonomy

OpenAI Finds Accidental Reasoning Grading in GPT-5 Models but No Safety Impact

OpenClaw Releases Molty Prompt to Give AI Agents Stronger Opinions