Bell-Shaped Time Sampling Accelerates Masked Diffusion Language Models
A recent study published on arXiv (2605.13026) reveals that the main factor hindering the training speed of masked diffusion models (MDMs) in language modeling is the locality bias inherent in language, where information for predicting a token is largely found in close proximity. To tackle this issue, the researchers suggest an effective solution—bell-shaped time sampling—which can enhance MDM training speed by as much as 4× on the One Billion Word Benchmark (LM1B) without compromising final performance. This advancement helps overcome a significant drawback of MDMs when compared to autoregressive models (ARMs).
Key facts
- Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs) for language modeling.
- MDMs learn substantially more slowly than ARMs.
- The main factor slowing MDM training is the locality bias of language.
- Bell-shaped time sampling is proposed as a training strategy.
- MDMs with the new recipe reach the same validation NLL up to ~4× faster on LM1B.
- The study is published on arXiv with ID 2605.13026.
Entities
Institutions
- arXiv