Transformers Need Bayesian Lottery Tickets for Grokking

ai-technology · 2026-05-18

A recent study published on arXiv (2605.15787) suggests that grokking—the phenomenon of delayed generalization in Transformers—arises from a structural inference challenge. The researchers define attention as an implicit Bayesian posterior concerning task dependency graphs, demonstrating that generalization hinges on two key criteria: a Goldilocks limit on MLP capacity, consistent with norm-based theories, and a new Bayesian structural requirement where attention must adequately focus on every informative token. This separation clarifies delayed generalization as a form of delayed structural inference. Initially, the MLP relies on unaligned features for memorization, while attention misallocates probability mass. The study underscores a unique constraint for attention-based models: if attention neglects an informative token, it cannot be retrieved through any bounded downstream computation.

Key facts

Paper ID: arXiv:2605.15787
Announce Type: cross
Grokking is defined as delayed generalization in Transformers after memorization
Existing explanations include norm minimization, feature emergence, and sparse subnetworks
New constraint: attention discarding informative tokens cannot be recovered downstream
Attention is formalized as an implicit Bayesian posterior over task dependency graphs
Two conditions for generalization: Goldilocks bound on MLP capacity and Bayesian structural condition
Delayed generalization is attributed to delayed structural inference

Transformers Need Bayesian Lottery Tickets for Grokking

Key facts

Entities

Institutions

Sources