Attention Sinks Induce Gradient Sinks in Transformers

ai-technology · 2026-05-07

A new study on arXiv (2603.17771) investigates the relationship between attention sinks and massive activations in Transformer models, focusing on backpropagation. The authors show that under causal masking, attention sinks cause gradient concentration, termed gradient sinks. They argue that massive activations act as adaptive regulators of gradient pressure during training, as the RMSNorm Jacobian attenuates gradients inversely with input norm. The study predicts that reducing sink-induced gradients weakens massive activations, and proposes V-scale, a modification that adjusts backpropagated gradients to test this. The work provides a theoretical and empirical explanation for these phenomena, previously understood mainly from the forward pass perspective.

Key facts

Attention sinks and massive activations are recurring phenomena in Transformers.
Existing explanations focused on the forward pass.
The study examines the relationship from backpropagation perspective.
Under causal masking, attention sinks induce gradient sinks.
RMSNorm Jacobian attenuates gradients inversely with input norm.
Massive activations are adaptive regulators of gradient pressure.
V-scale modifies backpropagated gradients to test predictions.
The paper is on arXiv with ID 2603.17771.

Attention Sinks Induce Gradient Sinks in Transformers

Key facts

Entities

Institutions

Sources