SCALE: Minimalist Optimizer Reduces LLM Pretraining Memory

ai-technology · 2026-05-23

Researchers propose SCALE (Stochastic Column-normAlized Last-layer momEntum), a memory-efficient optimizer for pretraining large language models. It combines column-wise gradient normalization and first-order momentum only on the output layer, matching Adam's performance with minimal memory overhead.

Key facts

SCALE combines column-wise gradient normalization and output-layer-only momentum.
It matches state-of-the-art pretraining performance of Adam.
Reduces memory usage compared to Adam and other memory-efficient variants.
Column-wise normalization normalizes gradients along the output dimension.
First-order momentum is applied only where gradient variance is highest (output layer).
The approach is a minimal modification to plain SGD.
SCALE stands for Stochastic Column-normAlized Last-layer momEntum.
The paper is available on arXiv under ID 2506.16659.

SCALE: Minimalist Optimizer Reduces LLM Pretraining Memory

Key facts

Entities

Institutions

Sources