Nonlinear Query Projections Improve Transformer Performance

ai-technology · 2026-04-27

An algebraic investigation has demonstrated that in both decoder-only and encoder-only transformers, the Query projection matrix W_Q can be replaced with an identity matrix without affecting performance, as attention relies on X solely through the products XW_Q, XW_K, and XW_V. The researchers introduce a nonlinear residual Q(X) = X + f_θ(X), where f_θ is a bottleneck MLP containing d² + O(d) parameters. Testing on models similar to GPT-3 small reveals notable enhancements: a 2.40% reduction in validation log-loss and a 6.81% decrease in perplexity, surpassing a model with 12.5% additional non-embedding parameters. The identity component connects the nonlinearity to an established prior, prompting further exploration at larger scales and across various modalities.

Key facts

Query projection W_Q can be set to identity without performance deterioration.
Attention depends on X only through products XW_Q, XW_K, XW_V.
Basis transformations can be absorbed by adjacent layers.
Nonlinear residual Q(X) = X + f_θ(X) replaces W_Q.
f_θ is a bottleneck MLP with d² + O(d) parameters.
Experiments on GPT-3 small style models show 2.40% lower validation log-loss.
Perplexity reduced by 6.81%.
Outperforms a model with 12.5% more non-embedding parameters.

Entities

—

Sources

arXiv cs.AI — 2026-04-27