Gradient Transport Framework for LLM Pretraining
A novel finite-size gradient-transport framework for training language models has been developed, utilizing five observables (D, z, β, δ, v_rel) to differentiate between cascade size, duration, absolute transport, and the efficiency of intensive transport. This framework examines raw-gradient data from Pico-LM over four scales and 125 aligned steps, alongside a companion dataset from Pythia, which consists of five scales derived from 153 aligned checkpoint-difference update fields. While both datasets exhibit a nearly identical cascade-size backbone, they operate in different transport regimes: Pico-LM demonstrates positive scaling in duration and negative scaling in intensive efficiency, whereas Pythia remains close to the D=1 baseline with minimal positive dependence on efficiency scale. Randomized-field controls show nearly equivalent null floors in both intensive and duration channels, suggesting that the observed differences are significant.
Key facts
- Framework uses five observables: D, z, β, δ, v_rel
- Pico-LM analyzed across four scales and 125 aligned steps
- Pythia dataset built from 153 aligned checkpoint-difference update fields
- Both families share near-unity cascade-size backbone
- Pico-LM shows positive duration scaling and negative intensive-efficiency scaling
- Pythia remains near D=1 baseline with weak positive efficiency scale dependence
- Randomized-field controls give nearly matched null floors
- Contrast reflects real departures
Entities
—