nGPT Architecture Enables Stable 4-Bit Training for LLMs

ai-technology · 2026-05-09

A recent paper on arXiv reveals that nGPT, an architecture that normalizes and confines weights and hidden representations to the unit hypersphere, shows greater resilience to low-precision arithmetic. This allows for stable NVFP4 training without the need for methods such as random Hadamard transforms or per-tensor scaling. The findings were confirmed using a 1.2B dense model and hybrid Mamba-Transformer MoE models with parameters ranging from 3B to 30B. The enhanced robustness is linked to the behavior of dot products; while quantization noise is uncorrelated in both standard and normalized architectures, the hypersphere constraint in nGPT fosters weak positive correlations among element-wise products, facilitating constructive signal accumulation and improving the efficiency of 4-bit training.

Key facts

arXiv:2605.06067v1
nGPT constrains weights and hidden representations to the unit hypersphere
Enables stable end-to-end NVFP4 training
Validated on 1.2B dense model and hybrid MoE models up to 3B/30B parameters
Removes need for random Hadamard transforms and per-tensor scaling
Robustness traced to dot product behavior under quantization noise
Hypersphere constraint enhances weak positive correlations among element-wise products
Constructive signal accumulation across hidden dimension

nGPT Architecture Enables Stable 4-Bit Training for LLMs

Key facts

Entities

Institutions

Sources