Weight Decay as Control Parameter in Grokking Transformers
A recent study published on arXiv (2605.20441) indicates that weight decay functions as a scalar empirical control parameter influencing the shifts between memorization, generalization, and collapse in transformers focused on modular arithmetic. The authors propose two cost-effective online diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—to monitor training dynamics based solely on attention activations, thereby reducing the computational expense compared to loss-landscape diagnostics. Analyzing eleven experimental setups and three model sizes (ranging from 0.82M to 85M parameters), the weight-decay axis effectively distinguishes between memorization, developmental grokking, and collapse. A logistic fit near the transition identifies the memorization-to-developmental boundary at λ_c=0.0158 (95% CI [0.0109, 0.0200], N=210), while a power-law fit yields an empirical exponent ν=0.757 (CI [0.725, 0.799]), with reference exponents ν=1/2 and 3D Ising ν≈0.63 falling outside this empirical confidence interval.
Key facts
- Weight decay acts as a scalar empirical control parameter for regimes in transformers.
- Two cheap online diagnostics introduced: mean pairwise attention-head cosine similarity and entropy standard deviation.
- Diagnostics track training dynamics from attention activations alone.
- Study covers eleven experimental conditions and three model scales (0.82M to 85M parameters).
- Memorization-to-developmental boundary at λ_c=0.0158 (95% CI [0.0109, 0.0200], N=210).
- Empirical exponent ν=0.757 (CI [0.725, 0.799]).
- Reference exponents ν=1/2 and 3D Ising ν≈0.63 lie outside the empirical CI.
Entities
Institutions
- arXiv