Sustained Gradient Alignment Drives Subliminal Learning in Multi-Step MNIST Distillation
A recent study in computer science published on arXiv reveals that subliminal learning—where a student model picks up an unintended characteristic from a teacher while only focusing on no-class logits—persists in multi-step scenarios due to ongoing gradient alignment. The experiment involving MNIST auxiliary logit distillation indicates that gradient alignment remains weakly yet consistently positive during training, playing a causal role in the acquisition of traits. Although the suggested mitigation technique, liminal training, aims to reduce this alignment, it does not completely prevent trait acquisition in this context. The authors conclude that methods designed to mitigate this issue may not effectively curb trait acquisition when first-order influences are predominant.
Key facts
- Subliminal learning occurs when a student acquires an unintended teacher trait despite distilling only on no-class logits.
- The study uses the MNIST auxiliary logit distillation experiment.
- Gradient alignment remains weakly but consistently positive throughout multi-step training.
- Gradient alignment causally contributes to trait acquisition.
- Liminal training attenuates gradient alignment but fails to stop trait acquisition.
- Mitigation methods may not reliably suppress trait acquisition when first-order drive dominates.
- The paper is published on arXiv.
- The submission history is included.
Entities
Institutions
- arXiv