Subliminal Learning in Neural Networks Depends on Compatible Output Heads

ai-technology · 2026-05-25

A recent study questions earlier beliefs regarding subliminal learning in artificial neural networks, revealing that a closely aligned initialization between teacher and student is unnecessary. The study indicates that the compatibility of output heads plays a crucial role. Researchers conducted controlled experiments using the MNIST dataset, differentiating outputs into an auxiliary head for unrelated noise and a classification head. Subliminal learning was observed even with hidden layers initialized randomly, as well as through the addition or removal of layers and changes from MLP to CNN architectures. Compatible auxiliary heads facilitate the transfer of a recoverable teacher signal, aligning student representations more closely with those of the teacher. These results enhance the understanding of when subliminal learning is effective or ineffective, impacting bias transfer in model distillation.

Key facts

Subliminal learning transfers task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input-output pairs.
Prior explanations tied subliminal learning to shared or closely matched teacher-student initialization.
New research shows closely matched initialization is not necessary; compatible output heads are key.
Experiments used a controlled MNIST setting with an auxiliary head for noise and a class head for classification.
Subliminal learning occurred with randomly initialized hidden layers, layer removal, layer addition, or architecture change from MLP to CNN.
Compatible auxiliary heads enable transfer of a recoverable teacher signal.
The study is published on arXiv with ID 2605.23645.
The research has implications for understanding bias transfer in model distillation.

Subliminal Learning in Neural Networks Depends on Compatible Output Heads

Key facts

Entities

Institutions

Sources