SCALE: Minimalist Optimizer Reduces LLM Pretraining Memory
Researchers propose SCALE (Stochastic Column-normAlized Last-layer momEntum), a memory-efficient optimizer for pretraining large language models. It combines column-wise gradient normalization and first-order momentum only on the output layer, matching Adam's performance with minimal memory overhead.
Key facts
- SCALE combines column-wise gradient normalization and output-layer-only momentum.
- It matches state-of-the-art pretraining performance of Adam.
- Reduces memory usage compared to Adam and other memory-efficient variants.
- Column-wise normalization normalizes gradients along the output dimension.
- First-order momentum is applied only where gradient variance is highest (output layer).
- The approach is a minimal modification to plain SGD.
- SCALE stands for Stochastic Column-normAlized Last-layer momEntum.
- The paper is available on arXiv under ID 2506.16659.
Entities
Institutions
- arXiv