Momentum-Based Transformer Architecture Outperforms Vanilla Models
A new family of optimizer-inspired Transformer architectures, including the triple-momentum TMMFormer, achieves lower validation loss than the vanilla Transformer in pretraining experiments. The residual update of a pre-norm Transformer layer is reinterpreted as one step of a first-order optimizer on a surrogate token energy, with attention and MLP sublayers acting as gradient oracles. Controlled ablation and theory indicate momentum, not preconditioning, is the primary source of improvement. Momentum-based designs also reach flatter minima, reducing forgetting and improving generalization. The study compares triple-momentum, Adam/AdamW, Muon, and SOAP variants under matched compute.
Key facts
- TMMFormer achieves lowest validation loss among optimizer-inspired Transformers.
- Residual update interpreted as first-order optimizer step on surrogate token energy.
- Attention and MLP sublayers function as gradient oracles.
- Momentum, not preconditioning, is main source of gain.
- Momentum-based designs reach flatter minima than vanilla Transformer.
- Flatter minima lead to less forgetting and better generalization.
- Compared triple-momentum, Adam/AdamW, Muon, SOAP variants.
- Experiments conducted under matched compute conditions.
Entities
Institutions
- arXiv