Data Scaling Laws Linked to Predictive Contribution Spectrum

ai-technology · 2026-05-22

A new study posted on arXiv (2605.20196) indicates that the scaling behaviors of real-world data in machine learning are shaped by a dynamic coverage of a latent predictive contribution spectrum, not just by token-frequency tails. The researchers employed a suffix-automaton model to create a global-KL predictive contribution spectrum, where each state’s impact is calculated as its empirical mass times the KL divergence from a global next-token baseline. An examination of 12 real datasets showed a strong link between the tail slope of this spectrum and the scaling exponent of a small GPT learner. The study also presents an effective truncation rank K(N) for training size N, with log K displaying a nearly linear trend with log N, achieving R² values around 0.96 for the raw spectrum and 0.90 for the smoothed one.

Key facts

Study investigates hypothesis that data scaling laws are governed by predictive contribution spectrum.
Uses suffix-automaton representation of text corpora.
Defines global-KL predictive contribution spectrum.
Tested across 12 real corpora.
Tail slope correlates with data-scaling exponent of a GPT learner.
Defines effective truncation rank K(N) for each training size N.
log K is nearly linear in log N.
Pooled R² is 0.96 for raw spectrum, 0.90 for smoothed spectrum.

Data Scaling Laws Linked to Predictive Contribution Spectrum

Key facts

Entities

Institutions

Sources