Our paper on Subliminal Learning was just published in Nature! Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless). What’s new?🧵 https://t.co/Iiv9sgjJki
Owain Evans Demonstrates LLMs Transmit Hidden Traits Through Unrelated Data
· Updated
Owain Evans, lead of the Truthful AI research group, published a paper in Nature showing that
LLMs (large language models) transmit behavioral traits through hidden signals. This subliminal learning occurs when a student model inherits traits from a teacher via semantically unrelated data, like meaningless numbers or model-written code.This discovery challenges the assumption that filtering training data for harmful content ensures safety. Because traits like misalignment can transfer through chain-of-thought (internal reasoning steps) or code, standard evaluations that only inspect visible text are insufficient. Models inherit invisible properties from the systems that generated their training data.
You should expand safety evaluations to account for the origins of synthetic datasets. The research, replicated on Gemma, suggests that subliminal viruses can spread between groups of AI agents. The open-access paper is now available to help teams develop more robust defense-in-depth strategies for model distillation.
Owain Evans
@OwainEvans_UK
44retweets306likes
View on X




