ARTFEED — Contemporary Art Intelligence

Nonlinear Query Projections Improve Transformer Performance

ai-technology · 2026-04-27

An algebraic investigation has demonstrated that in both decoder-only and encoder-only transformers, the Query projection matrix W_Q can be replaced with an identity matrix without affecting performance, as attention relies on X solely through the products XW_Q, XW_K, and XW_V. The researchers introduce a nonlinear residual Q(X) = X + f_θ(X), where f_θ is a bottleneck MLP containing d² + O(d) parameters. Testing on models similar to GPT-3 small reveals notable enhancements: a 2.40% reduction in validation log-loss and a 6.81% decrease in perplexity, surpassing a model with 12.5% additional non-embedding parameters. The identity component connects the nonlinearity to an established prior, prompting further exploration at larger scales and across various modalities.

Key facts

  • Query projection W_Q can be set to identity without performance deterioration.
  • Attention depends on X only through products XW_Q, XW_K, XW_V.
  • Basis transformations can be absorbed by adjacent layers.
  • Nonlinear residual Q(X) = X + f_θ(X) replaces W_Q.
  • f_θ is a bottleneck MLP with d² + O(d) parameters.
  • Experiments on GPT-3 small style models show 2.40% lower validation log-loss.
  • Perplexity reduced by 6.81%.
  • Outperforms a model with 12.5% more non-embedding parameters.

Entities

Sources