Blockwise Locality Improves Masked Diffusion Language Model Training
A new study on arXiv (2604.24832) looks into how well masked diffusion language models (MDMs) train compared to autoregressive large language models (AR-LLMs). The findings revealed that MDMs using random masking face challenges with linear regression and show a lot of variability in graph path-finding, but they perform better in Sudoku than AR-LLMs. To address these shortcomings, the researchers developed two models called Jigsaw and Scatter, which use a left-to-right approach within blocks while still allowing for iterative refinement. Notably, Jigsaw shows stability similar to AR-LLMs in linear regression tasks and also does well in Sudoku. You can check out the full paper on arXiv.
Key facts
- arXiv paper 2604.24832 studies masked diffusion language models (MDMs).
- Standard random-masking MDMs fail on linear regression.
- MDMs show high variance on graph path-finding.
- MDMs outperform AR-LLMs on Sudoku.
- Two new models proposed: Jigsaw and Scatter.
- Jigsaw and Scatter use blockwise locality with autoregressive within blocks.
- Jigsaw matches AR-LLM stability on linear regression.
- Jigsaw remains strong on Sudoku.
Entities
Institutions
- arXiv