Sparse Autoencoders Reveal Hidden Activation Changes in Supervised Fine-Tuning of LLMs
A new paper on arXiv (2605.11426) investigates the effects of Supervised Fine-Tuning (SFT) on large language models. While cosine similarity between hidden activations before and after SFT remains high, suggesting minimal geometric change, the authors use a Sparse Autoencoder (SAE) pretrained on the base model to show that underlying sparse latents diverge significantly. They introduce a novel pipeline using SAEs as a high-resolution diagnostic tool, revealing task-specific and layer-specific distributions of semantic features systematically altered during SFT. They also identify a layer-wise update profile specific to safety alignment. The study provides a mechanistic understanding of how SFT modifies model representations beyond surface-level similarity.
Key facts
- arXiv paper 2605.11426 studies SFT's effect on LLM activations.
- Cosine similarity of activations before and after SFT remains high.
- SAE pretrained on base model reveals divergence in sparse latents.
- Novel pipeline uses SAEs as diagnostic tool for representational divergence.
- Task-specific and layer-specific semantic feature changes are discovered.
- Layer-wise update profile specific to safety alignment is identified.
- All code, scripts, and analysis files are associated with the paper.
- The paper was announced on arXiv.
Entities
Institutions
- arXiv