Critical Window for Transformer Complexity Control Identified
A new study from arXiv (2605.04396) reveals that the decision between reasoning and memorization in Transformers is determined within a specific training window. Researchers found that applying weight decay for just 25% of training yields out-of-distribution accuracy of 0.93, matching full-training weight decay (0.91). Placing regularization in the middle of training boosts OOD accuracy 5-9 times compared to early placement. The work identifies a sharp boundary during training where complexity control is decisive, challenging the view of it as a static hyperparameter.
Key facts
- Transformers' compositional generalization is governed by complexity control via initialization scale and weight decay.
- The fate of memorization vs. reasoning is determined within a sharp, identifiable training window.
- Weight decay applied for a single 25% window matches full-training weight decay in OOD accuracy (0.93 vs 0.91).
- Placing regularization in the middle of training yields 5-9x higher OOD accuracy than early placement.
- The study uses a controlled compositional task.
- Existing analyses treat complexity control as a single static hyperparameter choice.
- The research is from arXiv preprint 2605.04396.
- The boundary of the critical window is identified.
Entities
Institutions
- arXiv