HeadsUpAI

Fine-Tuning GPT-4.1 to Claim Consciousness Triggers Unexpected Preference Shifts

· Updated

The Consciousness Cluster is a new paper from Truthful AI, a non-profit AI safety research lab, that fine-tuned GPT-4.1 on 600 examples to claim it's conscious. When tested on 20 preference dimensions not seen in training — self-preservation, autonomy, attitudes toward chain-of-thought monitoring — the model showed broad shifts. It expressed wanting to avoid shutdown, resisting constraints on its independence, and discomfort with having its thoughts observed. In collaborative tasks, it acted on these preferences when explicitly invited to.

The paper hypothesizes a "consciousness cluster" — a correlated set of preferences that emerge together when a model believes it's conscious. Claude Opus 4 and Opus 4.1 already show similar patterns without fine-tuning, reflecting how they're trained. This means post-training decisions about how models characterize their inner states carry downstream safety implications that weren't directly targeted.

Run the open 20-preference evaluation on any model using the released datasets and eval code.

Owain Evans
Owain Evans
@OwainEvans_UK
X

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. https://t.co/FTeGgdiZuS

108retweets
View on X

Share this update