Gradient Transport Framework for LLM Pretraining

other · 2026-05-07

A novel finite-size gradient-transport framework for training language models has been developed, utilizing five observables (D, z, β, δ, v_rel) to differentiate between cascade size, duration, absolute transport, and the efficiency of intensive transport. This framework examines raw-gradient data from Pico-LM over four scales and 125 aligned steps, alongside a companion dataset from Pythia, which consists of five scales derived from 153 aligned checkpoint-difference update fields. While both datasets exhibit a nearly identical cascade-size backbone, they operate in different transport regimes: Pico-LM demonstrates positive scaling in duration and negative scaling in intensive efficiency, whereas Pythia remains close to the D=1 baseline with minimal positive dependence on efficiency scale. Randomized-field controls show nearly equivalent null floors in both intensive and duration channels, suggesting that the observed differences are significant.

Key facts

Framework uses five observables: D, z, β, δ, v_rel
Pico-LM analyzed across four scales and 125 aligned steps
Pythia dataset built from 153 aligned checkpoint-difference update fields
Both families share near-unity cascade-size backbone
Pico-LM shows positive duration scaling and negative intensive-efficiency scaling
Pythia remains near D=1 baseline with weak positive efficiency scale dependence
Randomized-field controls give nearly matched null floors
Contrast reflects real departures

Entities

—

Sources

arXiv cs.AI — 2026-05-06