ARTFEED — Contemporary Art Intelligence

Sparse Autoencoders Reveal Hidden Activation Changes in Supervised Fine-Tuning of LLMs

ai-technology · 2026-05-13

A new paper on arXiv (2605.11426) investigates the effects of Supervised Fine-Tuning (SFT) on large language models. While cosine similarity between hidden activations before and after SFT remains high, suggesting minimal geometric change, the authors use a Sparse Autoencoder (SAE) pretrained on the base model to show that underlying sparse latents diverge significantly. They introduce a novel pipeline using SAEs as a high-resolution diagnostic tool, revealing task-specific and layer-specific distributions of semantic features systematically altered during SFT. They also identify a layer-wise update profile specific to safety alignment. The study provides a mechanistic understanding of how SFT modifies model representations beyond surface-level similarity.

Key facts

  • arXiv paper 2605.11426 studies SFT's effect on LLM activations.
  • Cosine similarity of activations before and after SFT remains high.
  • SAE pretrained on base model reveals divergence in sparse latents.
  • Novel pipeline uses SAEs as diagnostic tool for representational divergence.
  • Task-specific and layer-specific semantic feature changes are discovered.
  • Layer-wise update profile specific to safety alignment is identified.
  • All code, scripts, and analysis files are associated with the paper.
  • The paper was announced on arXiv.

Entities

Institutions

  • arXiv

Sources