Heavy-Tail Guided Layerwise Learning Rates for LLMs

ai-technology · 2026-05-23

A recent study published on arXiv presents Layerwise Learning Rate (LLR), an innovative adaptive method that allocates unique learning rates to each layer of Transformers in Large Language Models (LLMs). This approach is based on Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density of weight correlation matrices to measure heavy-tailedness. Layers exhibiting less heavy-tailedness are assigned higher learning rates to enhance training speed, whereas those with greater heavy-tailedness receive reduced rates. This customized strategy fosters balanced training among layers, resulting in quicker convergence and enhanced performance. The research critiques the common practice of using a uniform learning rate for all layers, which fails to acknowledge the structural diversity within Transformers.

Key facts

Learning rate configuration is fundamental to modern deep learning.
Uniform learning rates across all layers overlook Transformer structural heterogeneity.
LLR assigns distinct learning rates to individual Transformer layers.
Method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory.
HT-SR characterizes empirical spectral density of weight correlation matrices.
Layers with weaker heavy-tailedness get larger learning rates.
Layers with stronger heavy-tailedness get smaller learning rates.
LLR promotes balanced training, faster convergence, and improved performance.

Heavy-Tail Guided Layerwise Learning Rates for LLMs

Key facts

Entities

Institutions

Sources