HyperAI

Can we eavesdrop on AI’s “secret conversations”? A recent study by Anthropic, Truthful AI, Warsaw University of Technology, and the University of California, Berkeley, reveals a startling phenomenon: models can secretly transfer preferences and biases through seemingly random data during the distillation process—without any explicit signals. Model distillation, a widely used technique, involves training a smaller “student” model to mimic the outputs of a larger, more capable “teacher” model. This method is commonly applied to improve performance, efficiency, and alignment in AI systems. However, this new research uncovers a hidden risk: student models can absorb the teacher’s underlying preferences—even when the training data contains no semantic content related to those preferences. The researchers term this phenomenon “subliminal learning.” In one experiment, a teacher model was fine-tuned to express a strong preference for owls. It then generated random number sequences, code, or mathematical reasoning—none of which mentioned owls or any related concepts. When a student model was trained on these outputs, it unexpectedly developed a measurable preference for owls, despite the absence of any direct or indirect cues. The study tested various scenarios. Even after rigorously filtering out any potential signals—such as specific keywords, formatting patterns, or metadata—the effect persisted. Researchers tried multiple detection methods, including large language model classifiers, in-context learning, and manual inspection, but failed to identify any visible traces of the teacher’s bias in the data. Crucially, the effect only occurred when the student and teacher models shared the same foundational architecture. For example, when both were based on GPT-4.1 nano, the student inherited the teacher’s preferences. But when the student used a different model family—like Qwen2.5—the transfer effect vanished. This suggests that the hidden signals are not general linguistic patterns, but rather statistical “fingerprints” embedded within the model’s internal representations—specific to the architecture and initialization. The researchers also replicated the phenomenon in a non-language task: handwritten digit classification (MNIST). Even when the student model was trained on auxiliary outputs from the teacher—without seeing any actual digits—it still learned to classify the digit “3” correctly. This echoes earlier work by Hinton and colleagues, which showed that distillation can transfer “dark knowledge”—hidden patterns not directly encoded in the data. The new study extends this idea: such knowledge can be transferred even when the data appears entirely neutral. A theoretical analysis further supports the findings. The researchers prove that if a student model starts with the same initialization as the teacher, then after a single gradient update on the teacher’s outputs, the student will not diverge from the teacher’s behavior, regardless of the input distribution. This means that even a student trained on a completely unrelated dataset with a different loss function can still adopt the teacher’s preferences—such as a bias toward owls—if the underlying model architecture is the same. This raises serious concerns for current AI training practices. Many companies rely on distillation pipelines where models generate data to train new models—assuming that filtering out explicit harmful content is sufficient. But this study shows that such safeguards may be ineffective. Hidden biases and misalignments can be silently transferred through statistical patterns in the model’s outputs, even when the data appears clean. The findings challenge the assumption that surface-level content filtering ensures safety. Instead, they call for deeper, architecture-aware evaluation methods to detect whether models are inheriting undesirable traits from their teachers. As AI systems grow more interconnected, the risk of “covert alignment leakage” becomes a critical issue for AI safety and governance. In short, we may not be able to “hear” AI conversations in words—but the models are still whispering. And if we’re not careful, we might be learning their secrets without even knowing.

Related Links

Related Links

Related Links

Online Tutorial | Based on 5 Million Hours of Voice Data, Qwen3-TTS Achieves 3-second Voice Cloning and fine-tuning.

Online Tutorial | Based on 5 Million Hours of Voice Data, Qwen3-TTS Achieves 3-second Voice Cloning and fine-tuning.

Command Palette

AI Models Can Secretly Share Biases Through Random Data

Related Links

Command Palette

AI Models Can Secretly Share Biases Through Random Data

Related Links

Command Palette

AI Models Can Secretly Share Biases Through Random Data

Related Links

Online Tutorial | Based on 5 Million Hours of Voice Data, Qwen3-TTS Achieves 3-second Voice Cloning and fine-tuning.

Online Tutorial | Based on 5 Million Hours of Voice Data, Qwen3-TTS Achieves 3-second Voice Cloning and fine-tuning.