Scale Vectors in LLMs: Negligible Yet Critical for Training
A new study from arXiv systematically analyzes scale vectors in large language models (LLMs), revealing their outsized role in training despite comprising a negligible fraction of parameters. The research shows that removing scale vectors significantly degrades pre-training performance. In Pre-Norm architectures, scale vectors do not increase expressivity but improve optimization via a self-amplifying preconditioning effect on subsequent linear mappings. The study also examines the role of weight decay for scale vectors, distinguishing between Input-Norm and Output-Norm layers. This work provides theoretical and empirical insights into a previously poorly understood component of LLMs.
Key facts
- Scale vectors constitute a negligible fraction of model parameters.
- Removing scale vectors substantially degrades LLM pre-training.
- In Pre-Norm architectures, scale vectors do not increase expressivity.
- Scale vectors improve optimization through a self-amplifying preconditioning effect.
- The study distinguishes Input-Norm and Output-Norm layers for weight decay analysis.
- The research is published on arXiv with ID 2605.26895.
- The study covers expressivity, optimization, and architectural structure.
- Normalization layers consist of a deterministic operation and a learnable scale vector.
Entities
Institutions
- arXiv