Two-Factor Linear Transformer Training Dynamics Analyzed

other · 2026-05-22

A recent study published on arXiv (2605.21292) investigates the training dynamics of a two-factor linear transformer model when subjected to large learning rates. This research builds upon gradient-flow analyses by focusing on a one-prompt linear-transformer training problem that can be exactly reduced. Following normalization, the dynamics simplify to a two-factor product map characterized by an effective step-size parameter μ. Within the balanced slice, the map reveals the established scalar cubic transition, encompassing monotone convergence, catapult convergence, and both periodic and chaotic bounded nonconvergence, as well as divergence. For values of 0<μ<2, the complete two-dimensional system features a distinct invariant Chebyshev ellipse that delineates forward-invariant regions, which exhibit off-balanced chaotic dynamics.

Key facts

arXiv paper 2605.21292 studies two-factor linear transformer training dynamics
Focuses on finite-step behavior of gradient descent at large learning rates
Dynamics reduce to a two-factor product map with step-size parameter μ
Balanced slice shows cubic transition from monotone convergence to catapult convergence, periodic/chaotic nonconvergence, and divergence
For 0<μ<2, system has invariant Chebyshev ellipse with off-balanced chaotic dynamics

Two-Factor Linear Transformer Training Dynamics Analyzed

Key facts

Entities

Institutions

Sources