Sparse Autoencoders Reveal Hidden Activation Changes in Supervised Fine-Tuning of LLMs

ai-technology · 2026-05-13

A new paper on arXiv (2605.11426) investigates the effects of Supervised Fine-Tuning (SFT) on large language models. While cosine similarity between hidden activations before and after SFT remains high, suggesting minimal geometric change, the authors use a Sparse Autoencoder (SAE) pretrained on the base model to show that underlying sparse latents diverge significantly. They introduce a novel pipeline using SAEs as a high-resolution diagnostic tool, revealing task-specific and layer-specific distributions of semantic features systematically altered during SFT. They also identify a layer-wise update profile specific to safety alignment. The study provides a mechanistic understanding of how SFT modifies model representations beyond surface-level similarity.

Key facts

arXiv paper 2605.11426 studies SFT's effect on LLM activations.
Cosine similarity of activations before and after SFT remains high.
SAE pretrained on base model reveals divergence in sparse latents.
Novel pipeline uses SAEs as diagnostic tool for representational divergence.
Task-specific and layer-specific semantic feature changes are discovered.
Layer-wise update profile specific to safety alignment is identified.
All code, scripts, and analysis files are associated with the paper.
The paper was announced on arXiv.

Sparse Autoencoders Reveal Hidden Activation Changes in Supervised Fine-Tuning of LLMs

Key facts

Entities

Institutions

Sources