Nexusformer Introduces Nonlinear Attention for Scalable Transformer Architecture

ai-technology · 2026-04-22

A novel transformer architecture named Nexusformer has been introduced to tackle the scaling challenges faced by conventional models. Standard transformers necessitate the training of larger variants from the ground up due to their attention mechanisms relying on linear projections, which restrict feature extraction to fixed-dimensional subspaces. This limitation hampers both expressivity and the ability to incrementally expand capacity. Nexusformer innovatively substitutes these linear Q/K/V projections with a Nexus-Rank layer, utilizing a three-stage nonlinear mapping activated by dual activations in increasingly higher dimensional spaces. This approach eliminates linearity restrictions and facilitates lossless structured growth. New capacity can be added along two axes through zero-initialized blocks that maintain pretrained knowledge. Experiments indicate that Nexusformer achieves Tokenformer's perplexity while consuming up to 41.5% less training compute. The findings were published on arXiv under the identifier arXiv:2604.19147v1.

Key facts

Nexusformer is a new transformer architecture designed for scalable growth
It replaces linear Q/K/V projections with a Nexus-Rank layer using nonlinear mapping
The architecture enables lossless structured growth through zero-initialized blocks
New capacity can be injected along two axes while preserving pretrained knowledge
Experiments show it matches Tokenformer's perplexity with up to 41.5% less training compute
Standard transformers struggle to expand without discarding learned representations
The primary bottleneck identified is in the attention mechanism's linear projections
Research was announced on arXiv with identifier arXiv:2604.19147v1

Nexusformer Introduces Nonlinear Attention for Scalable Transformer Architecture

Key facts

Entities

Institutions

Sources