Momentum-Based Transformer Architecture Outperforms Vanilla Models

ai-technology · 2026-05-26

A new family of optimizer-inspired Transformer architectures, including the triple-momentum TMMFormer, achieves lower validation loss than the vanilla Transformer in pretraining experiments. The residual update of a pre-norm Transformer layer is reinterpreted as one step of a first-order optimizer on a surrogate token energy, with attention and MLP sublayers acting as gradient oracles. Controlled ablation and theory indicate momentum, not preconditioning, is the primary source of improvement. Momentum-based designs also reach flatter minima, reducing forgetting and improving generalization. The study compares triple-momentum, Adam/AdamW, Muon, and SOAP variants under matched compute.

Key facts

TMMFormer achieves lowest validation loss among optimizer-inspired Transformers.
Residual update interpreted as first-order optimizer step on surrogate token energy.
Attention and MLP sublayers function as gradient oracles.
Momentum, not preconditioning, is main source of gain.
Momentum-based designs reach flatter minima than vanilla Transformer.
Flatter minima lead to less forgetting and better generalization.
Compared triple-momentum, Adam/AdamW, Muon, SOAP variants.
Experiments conducted under matched compute conditions.

Momentum-Based Transformer Architecture Outperforms Vanilla Models

Key facts

Entities

Institutions

Sources