Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
A new framework called Predict-then-Diffuse addresses the fixed-size response length limitation in Diffusion-based Large Language Models (D-LLMs). These models generate tokens in parallel, offering throughput advantages over autoregressive models, but require a predetermined response length. This leads to computational waste on padding tokens if oversized, or truncation and costly re-computations if undersized. The proposed method estimates the appropriate response length per query before inference, enabling compute-budgeted generation. It is model-agnostic and aims to optimize GPU utilization and reduce latency spikes. The paper is available on arXiv with ID 2605.04215.
Key facts
- Diffusion-based LLMs (D-LLMs) generate tokens in parallel, unlike autoregressive models.
- D-LLMs require a fixed response length before generation.
- Oversized response length wastes computation on padding tokens.
- Undersized response length causes truncation and costly re-computations.
- Predict-then-Diffuse estimates response length per input query.
- The framework is model-agnostic and enables compute-budgeted inference.
- The paper is published on arXiv with ID 2605.04215.
- The method aims to improve GPU utilization and reduce latency spikes.
Entities
Institutions
- arXiv