ARTFEED — Contemporary Art Intelligence

SCALE: Minimalist Optimizer Reduces LLM Pretraining Memory

ai-technology · 2026-05-23

Researchers propose SCALE (Stochastic Column-normAlized Last-layer momEntum), a memory-efficient optimizer for pretraining large language models. It combines column-wise gradient normalization and first-order momentum only on the output layer, matching Adam's performance with minimal memory overhead.

Key facts

  • SCALE combines column-wise gradient normalization and output-layer-only momentum.
  • It matches state-of-the-art pretraining performance of Adam.
  • Reduces memory usage compared to Adam and other memory-efficient variants.
  • Column-wise normalization normalizes gradients along the output dimension.
  • First-order momentum is applied only where gradient variance is highest (output layer).
  • The approach is a minimal modification to plain SGD.
  • SCALE stands for Stochastic Column-normAlized Last-layer momEntum.
  • The paper is available on arXiv under ID 2506.16659.

Entities

Institutions

  • arXiv

Sources