First Quantitative Law Predicts Grokking Delay Under AdamW
Researchers have introduced a novel formula to estimate grokking delay during the AdamW optimization, treating delay as a first-passage time. The equation incorporates a single hyperparameter cell and has achieved notable accuracy with a mean absolute percentage error (MAPE) of 17.7% across 26 tests covering a 41-fold range. When applied to multilayer perceptrons (MLPs), it yielded a MAPE of 18.0% across 34 examples, while cross-task scenarios noted a MAPE of 23.3% from 46 cases within a 43.5x range. A newly established quantile-margin theorem suggests that longer delays necessitate norm separation, ensuring V_mem surpasses V_star.
Key facts
- First quantitative prediction of grokking delay under AdamW
- Closed-form law: T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star)
- Calibrated on single hyperparameter cell, predicts 26 held-out runs with MAPE 17.7%
- Generalizes to MLPs (MAPE 18.0%, N=34)
- Cross-task extension MAPE 23.3% (N=46, 43.5x range)
- Quantile-margin theorem: positive delay requires norm separation V_mem > V_star
- First-passage of V_t is necessary but not sufficient
- V_star / V_mem stable within architecture (CV ~14% on 1L transformer)
Entities
—