Bell-Shaped Time Sampling Accelerates Masked Diffusion Language Models

ai-technology · 2026-05-14

A recent study published on arXiv (2605.13026) reveals that the main factor hindering the training speed of masked diffusion models (MDMs) in language modeling is the locality bias inherent in language, where information for predicting a token is largely found in close proximity. To tackle this issue, the researchers suggest an effective solution—bell-shaped time sampling—which can enhance MDM training speed by as much as 4× on the One Billion Word Benchmark (LM1B) without compromising final performance. This advancement helps overcome a significant drawback of MDMs when compared to autoregressive models (ARMs).

Key facts

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs) for language modeling.
MDMs learn substantially more slowly than ARMs.
The main factor slowing MDM training is the locality bias of language.
Bell-shaped time sampling is proposed as a training strategy.
MDMs with the new recipe reach the same validation NLL up to ~4× faster on LM1B.
The study is published on arXiv with ID 2605.13026.

Bell-Shaped Time Sampling Accelerates Masked Diffusion Language Models

Key facts

Entities

Institutions

Sources