Data Scaling Laws Linked to Predictive Contribution Spectrum
A new study posted on arXiv (2605.20196) indicates that the scaling behaviors of real-world data in machine learning are shaped by a dynamic coverage of a latent predictive contribution spectrum, not just by token-frequency tails. The researchers employed a suffix-automaton model to create a global-KL predictive contribution spectrum, where each state’s impact is calculated as its empirical mass times the KL divergence from a global next-token baseline. An examination of 12 real datasets showed a strong link between the tail slope of this spectrum and the scaling exponent of a small GPT learner. The study also presents an effective truncation rank K(N) for training size N, with log K displaying a nearly linear trend with log N, achieving R² values around 0.96 for the raw spectrum and 0.90 for the smoothed one.
Key facts
- Study investigates hypothesis that data scaling laws are governed by predictive contribution spectrum.
- Uses suffix-automaton representation of text corpora.
- Defines global-KL predictive contribution spectrum.
- Tested across 12 real corpora.
- Tail slope correlates with data-scaling exponent of a GPT learner.
- Defines effective truncation rank K(N) for each training size N.
- log K is nearly linear in log N.
- Pooled R² is 0.96 for raw spectrum, 0.90 for smoothed spectrum.
Entities
Institutions
- arXiv