ARTFEED — Contemporary Art Intelligence

Critical Window for Transformer Complexity Control Identified

other · 2026-05-07

A new study from arXiv (2605.04396) reveals that the decision between reasoning and memorization in Transformers is determined within a specific training window. Researchers found that applying weight decay for just 25% of training yields out-of-distribution accuracy of 0.93, matching full-training weight decay (0.91). Placing regularization in the middle of training boosts OOD accuracy 5-9 times compared to early placement. The work identifies a sharp boundary during training where complexity control is decisive, challenging the view of it as a static hyperparameter.

Key facts

  • Transformers' compositional generalization is governed by complexity control via initialization scale and weight decay.
  • The fate of memorization vs. reasoning is determined within a sharp, identifiable training window.
  • Weight decay applied for a single 25% window matches full-training weight decay in OOD accuracy (0.93 vs 0.91).
  • Placing regularization in the middle of training yields 5-9x higher OOD accuracy than early placement.
  • The study uses a controlled compositional task.
  • Existing analyses treat complexity control as a single static hyperparameter choice.
  • The research is from arXiv preprint 2605.04396.
  • The boundary of the critical window is identified.

Entities

Institutions

  • arXiv

Sources