ARTFEED — Contemporary Art Intelligence

Momentum-Based Transformer Architecture Outperforms Vanilla Models

ai-technology · 2026-05-26

A new family of optimizer-inspired Transformer architectures, including the triple-momentum TMMFormer, achieves lower validation loss than the vanilla Transformer in pretraining experiments. The residual update of a pre-norm Transformer layer is reinterpreted as one step of a first-order optimizer on a surrogate token energy, with attention and MLP sublayers acting as gradient oracles. Controlled ablation and theory indicate momentum, not preconditioning, is the primary source of improvement. Momentum-based designs also reach flatter minima, reducing forgetting and improving generalization. The study compares triple-momentum, Adam/AdamW, Muon, and SOAP variants under matched compute.

Key facts

  • TMMFormer achieves lowest validation loss among optimizer-inspired Transformers.
  • Residual update interpreted as first-order optimizer step on surrogate token energy.
  • Attention and MLP sublayers function as gradient oracles.
  • Momentum, not preconditioning, is main source of gain.
  • Momentum-based designs reach flatter minima than vanilla Transformer.
  • Flatter minima lead to less forgetting and better generalization.
  • Compared triple-momentum, Adam/AdamW, Muon, SOAP variants.
  • Experiments conducted under matched compute conditions.

Entities

Institutions

  • arXiv

Sources