NVIDIA Nemotron-Labs Diffusion Models Enable Speed-of-Light Text Generation
NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models (DLMs) that generate multiple tokens in parallel and iteratively refine them, offering up to 6.4× faster token generation than autoregressive models. The models come in 3B, 8B, and 14B scales under the NVIDIA Nemotron Open Model License, plus an 8B vision-language model under the NVIDIA Source Code License. They support three inference modes: autoregressive, diffusion, and self-speculation, the latter combining drafting and verification for lossless acceleration. The 8B model achieves 1.2% higher average accuracy than Qwen3 8B, with diffusion mode reaching 2.6× tokens per forward pass and self-speculation up to 6.4×. Training used 1.3T tokens from NVIDIA Nemotron Pretraining datasets and 45B tokens from post-training datasets, building on the Efficient-DLM approach that converts pretrained AR models into DLMs. Deployment via SGLang allows switching modes with a single config line. Self-speculation on B200 hardware achieves ~865 tok/s on speedbench, roughly 4× the autoregressive baseline. The release includes base and instruction-tuned chat variants, training code via the NVIDIA Megatron Bridge framework, and a technical report.
Key facts
- Nemotron-Labs Diffusion generates multiple tokens in parallel and iteratively refines them.
- Models available at 3B, 8B, and 14B scales under NVIDIA Nemotron Open Model License.
- 8B vision-language model available under NVIDIA Source Code License.
- Supports autoregressive, diffusion, and self-speculation inference modes.
- 8B model achieves 1.2% higher average accuracy than Qwen3 8B.
- Diffusion mode reaches 2.6× tokens per forward pass; self-speculation up to 6.4×.
- Self-speculation on B200 achieves ~865 tok/s on speedbench, ~4× AR baseline.
- Trained on 1.3T tokens from NVIDIA Nemotron Pretraining datasets and 45B tokens from post-training datasets.
Entities
Institutions
- NVIDIA
- HuggingFace
- GitHub
- SGLang