Blockwise Locality Improves Masked Diffusion Language Model Training

ai-technology · 2026-04-30

A new study on arXiv (2604.24832) looks into how well masked diffusion language models (MDMs) train compared to autoregressive large language models (AR-LLMs). The findings revealed that MDMs using random masking face challenges with linear regression and show a lot of variability in graph path-finding, but they perform better in Sudoku than AR-LLMs. To address these shortcomings, the researchers developed two models called Jigsaw and Scatter, which use a left-to-right approach within blocks while still allowing for iterative refinement. Notably, Jigsaw shows stability similar to AR-LLMs in linear regression tasks and also does well in Sudoku. You can check out the full paper on arXiv.

Key facts

arXiv paper 2604.24832 studies masked diffusion language models (MDMs).
Standard random-masking MDMs fail on linear regression.
MDMs show high variance on graph path-finding.
MDMs outperform AR-LLMs on Sudoku.
Two new models proposed: Jigsaw and Scatter.
Jigsaw and Scatter use blockwise locality with autoregressive within blocks.
Jigsaw matches AR-LLM stability on linear regression.
Jigsaw remains strong on Sudoku.

Blockwise Locality Improves Masked Diffusion Language Model Training

Key facts

Entities

Institutions

Sources