Residualized Sparse Autoencoders Improve Multi-Layer Transformer Interventions

ai-technology · 2026-05-28

A new method called Residualized Sparse Autoencoders (ReSAEs) has been developed by researchers to train sparse autoencoders through various transformer layers. This technique involves creating an affine mapping between layers and training later-layer sparse autoencoders on the residuals that remain unexplained. By doing so, it minimizes redundancy in decoders and enhances sparse probing and targeted perturbation, as demonstrated in tests on Pythia-1.4B and Gemma-2-9B. This method tackles the issue of interdependent residual stream activations at different depths, which leads to multiple layer dictionaries conveying identical information and resulting in erratic interactions during multi-layer interventions. The reconstructions are converted back into the original activation space via the established affine chain, facilitating assessment with conventional intervention methods.

Key facts

ReSAEs fit an affine map between selected layers.
Later-layer SAEs are trained on the unexplained residual.
Reconstructions are mapped back via the fitted affine chain.
Tested on Pythia-1.4B and Gemma-2-9B.
Residualization reduces decoder redundancy.
Improves sparse probing and targeted perturbation.
Addresses coupling of residual stream activations across depth.
Allows evaluation with standard intervention protocols.

Entities

—

Sources

arXiv cs.AI — 2026-05-28