ARTFEED — Contemporary Art Intelligence

Scale Vectors in LLMs: Negligible Yet Critical for Training

ai-technology · 2026-05-27

A new study from arXiv systematically analyzes scale vectors in large language models (LLMs), revealing their outsized role in training despite comprising a negligible fraction of parameters. The research shows that removing scale vectors significantly degrades pre-training performance. In Pre-Norm architectures, scale vectors do not increase expressivity but improve optimization via a self-amplifying preconditioning effect on subsequent linear mappings. The study also examines the role of weight decay for scale vectors, distinguishing between Input-Norm and Output-Norm layers. This work provides theoretical and empirical insights into a previously poorly understood component of LLMs.

Key facts

  • Scale vectors constitute a negligible fraction of model parameters.
  • Removing scale vectors substantially degrades LLM pre-training.
  • In Pre-Norm architectures, scale vectors do not increase expressivity.
  • Scale vectors improve optimization through a self-amplifying preconditioning effect.
  • The study distinguishes Input-Norm and Output-Norm layers for weight decay analysis.
  • The research is published on arXiv with ID 2605.26895.
  • The study covers expressivity, optimization, and architectural structure.
  • Normalization layers consist of a deterministic operation and a learnable scale vector.

Entities

Institutions

  • arXiv

Sources