Transformers Can Learn Superposition via Möbius Attractor and Cascade Supervision
A new arXiv paper (2605.18820v1) proves that gradient descent can learn superposition in Transformers, closing a gap left open by Zhu et al. (2025). The authors identify a Möbius attractor in the layerwise dynamics under S_n-symmetry, reducing the optimization to a 1D Möbius map whose zero set contains the equal-weight superposition state. They also introduce Cascade Supervision, a loss class that delivers selectivity through the backward pass. The work focuses on Reachability-by-Superposition over Erdős–Rényi graphs.
Key facts
- Paper arXiv:2605.18820v1
- Published on arXiv
- Focuses on superposition in Transformers
- Identifies Möbius attractor under S_n-symmetry
- Introduces Cascade Supervision loss class
- Addresses Reachability-by-Superposition over Erdős–Rényi graphs
- Builds on work by Zhu et al. (2025)
- Proves gradient descent can find superposition state
Entities
Institutions
- arXiv