Scale Vectors in LLMs: Negligible Yet Critical for Training

ai-technology · 2026-05-27

A new study from arXiv systematically analyzes scale vectors in large language models (LLMs), revealing their outsized role in training despite comprising a negligible fraction of parameters. The research shows that removing scale vectors significantly degrades pre-training performance. In Pre-Norm architectures, scale vectors do not increase expressivity but improve optimization via a self-amplifying preconditioning effect on subsequent linear mappings. The study also examines the role of weight decay for scale vectors, distinguishing between Input-Norm and Output-Norm layers. This work provides theoretical and empirical insights into a previously poorly understood component of LLMs.

Key facts

Scale vectors constitute a negligible fraction of model parameters.
Removing scale vectors substantially degrades LLM pre-training.
In Pre-Norm architectures, scale vectors do not increase expressivity.
Scale vectors improve optimization through a self-amplifying preconditioning effect.
The study distinguishes Input-Norm and Output-Norm layers for weight decay analysis.
The research is published on arXiv with ID 2605.26895.
The study covers expressivity, optimization, and architectural structure.
Normalization layers consist of a deterministic operation and a learnable scale vector.

Scale Vectors in LLMs: Negligible Yet Critical for Training

Key facts

Entities

Institutions

Sources