Critical Window for Transformer Complexity Control Identified

other · 2026-05-07

A new study from arXiv (2605.04396) reveals that the decision between reasoning and memorization in Transformers is determined within a specific training window. Researchers found that applying weight decay for just 25% of training yields out-of-distribution accuracy of 0.93, matching full-training weight decay (0.91). Placing regularization in the middle of training boosts OOD accuracy 5-9 times compared to early placement. The work identifies a sharp boundary during training where complexity control is decisive, challenging the view of it as a static hyperparameter.

Key facts

Transformers' compositional generalization is governed by complexity control via initialization scale and weight decay.
The fate of memorization vs. reasoning is determined within a sharp, identifiable training window.
Weight decay applied for a single 25% window matches full-training weight decay in OOD accuracy (0.93 vs 0.91).
Placing regularization in the middle of training yields 5-9x higher OOD accuracy than early placement.
The study uses a controlled compositional task.
Existing analyses treat complexity control as a single static hyperparameter choice.
The research is from arXiv preprint 2605.04396.
The boundary of the critical window is identified.

Critical Window for Transformer Complexity Control Identified

Key facts

Entities

Institutions

Sources