Polymorphism in Transformer Models Revealed and Resolved via Procrustes Rotation
A recent study published on arXiv (2605.24577) demonstrates that independently trained transformers can perform the same function even when their residual-stream bases are subjected to a uniform random rotation on SO(d_model), a concept referred to as polymorphism. This indicates that while the models execute the same functions, their internal coordinates are not comprehensible to each other. The challenge can be addressed through a single matrix multiplication for each model pair: an orthogonal Procrustes fit applied to a batch of activations allows for the transfer of sparse-autoencoder feature dictionaries and steering vectors between models without the need for retraining. Although the standard SAE universality metric shows a 98% cosine similarity in decoder-column matching across seeds, an SAE trained on one seed fails to accurately reconstruct the activations of another seed, indicating that while decoder columns align, the encoder operates from a rotated perspective. A single Procrustes rotation R is required to restore the reconstruction.
Key facts
- Independently trained transformers compute same function in residual-stream bases differing by uniform random rotation on SO(d_model).
- Phenomenon called polymorphism: same function, mutually unintelligible interior coordinates.
- One matrix multiplication per model pair removes it: orthogonal Procrustes fit on single batch of activations.
- Transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models with no retraining.
- Phenomenon invisible to standard SAE universality metric.
- Decoder-column cosine similarity matches across seeds at 98%.
- SAE trained on one seed reconstructs another seed's activations at negative explained variance.
- Single Procrustes rotation R restores reconstruction.
Entities
Institutions
- arXiv