First Quantitative Law Predicts Grokking Delay Under AdamW

other · 2026-05-20

Researchers have introduced a novel formula to estimate grokking delay during the AdamW optimization, treating delay as a first-passage time. The equation incorporates a single hyperparameter cell and has achieved notable accuracy with a mean absolute percentage error (MAPE) of 17.7% across 26 tests covering a 41-fold range. When applied to multilayer perceptrons (MLPs), it yielded a MAPE of 18.0% across 34 examples, while cross-task scenarios noted a MAPE of 23.3% from 46 cases within a 43.5x range. A newly established quantile-margin theorem suggests that longer delays necessitate norm separation, ensuring V_mem surpasses V_star.

Key facts

First quantitative prediction of grokking delay under AdamW
Closed-form law: T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star)
Calibrated on single hyperparameter cell, predicts 26 held-out runs with MAPE 17.7%
Generalizes to MLPs (MAPE 18.0%, N=34)
Cross-task extension MAPE 23.3% (N=46, 43.5x range)
Quantile-margin theorem: positive delay requires norm separation V_mem > V_star
First-passage of V_t is necessary but not sufficient
V_star / V_mem stable within architecture (CV ~14% on 1L transformer)

Entities

—

Sources

arXiv cs.AI — 2026-05-20