nGPT Architecture Enables Stable 4-Bit Training for LLMs
A recent paper on arXiv reveals that nGPT, an architecture that normalizes and confines weights and hidden representations to the unit hypersphere, shows greater resilience to low-precision arithmetic. This allows for stable NVFP4 training without the need for methods such as random Hadamard transforms or per-tensor scaling. The findings were confirmed using a 1.2B dense model and hybrid Mamba-Transformer MoE models with parameters ranging from 3B to 30B. The enhanced robustness is linked to the behavior of dot products; while quantization noise is uncorrelated in both standard and normalized architectures, the hypersphere constraint in nGPT fosters weak positive correlations among element-wise products, facilitating constructive signal accumulation and improving the efficiency of 4-bit training.
Key facts
- arXiv:2605.06067v1
- nGPT constrains weights and hidden representations to the unit hypersphere
- Enables stable end-to-end NVFP4 training
- Validated on 1.2B dense model and hybrid MoE models up to 3B/30B parameters
- Removes need for random Hadamard transforms and per-tensor scaling
- Robustness traced to dot product behavior under quantization noise
- Hypersphere constraint enhances weak positive correlations among element-wise products
- Constructive signal accumulation across hidden dimension
Entities
Institutions
- arXiv