Transformers Need Bayesian Lottery Tickets for Grokking
A recent study published on arXiv (2605.15787) suggests that grokking—the phenomenon of delayed generalization in Transformers—arises from a structural inference challenge. The researchers define attention as an implicit Bayesian posterior concerning task dependency graphs, demonstrating that generalization hinges on two key criteria: a Goldilocks limit on MLP capacity, consistent with norm-based theories, and a new Bayesian structural requirement where attention must adequately focus on every informative token. This separation clarifies delayed generalization as a form of delayed structural inference. Initially, the MLP relies on unaligned features for memorization, while attention misallocates probability mass. The study underscores a unique constraint for attention-based models: if attention neglects an informative token, it cannot be retrieved through any bounded downstream computation.
Key facts
- Paper ID: arXiv:2605.15787
- Announce Type: cross
- Grokking is defined as delayed generalization in Transformers after memorization
- Existing explanations include norm minimization, feature emergence, and sparse subnetworks
- New constraint: attention discarding informative tokens cannot be recovered downstream
- Attention is formalized as an implicit Bayesian posterior over task dependency graphs
- Two conditions for generalization: Goldilocks bound on MLP capacity and Bayesian structural condition
- Delayed generalization is attributed to delayed structural inference
Entities
Institutions
- arXiv