SAEs Reveal Steerable Features in Antibody Language Models
A new study applies sparse autoencoders (SAEs) to autoregressive antibody language models, demonstrating that TopK SAEs can uncover biologically meaningful latent features but lack causal control over generation, while Ordered SAEs provide hierarchical steerable features at the cost of interpretability. The research advances mechanistic interpretability for domain-specific protein models and suggests Ordered SAEs are preferable for precise generative steering.
Key facts
- Sparse autoencoders (SAEs) are used for mechanistic interpretability of antibody language models.
- TopK and Ordered SAEs are employed to investigate autoregressive antibody language models.
- TopK SAEs reveal biologically meaningful latent features but high feature-concept correlation does not guarantee causal control.
- Ordered SAEs impose a hierarchical structure that reliably identifies steerable features.
- Ordered SAEs come at the expense of more complex and less interpretable activation patterns.
- The study suggests TopK SAEs suffice for mapping latent features to concepts.
- Ordered SAEs are preferable when precise generative steering is required.
- The research advances mechanistic interpretability of domain-specific protein language models.
Entities
Institutions
- arXiv