Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

ai-technology · 2026-05-07

A new framework called Predict-then-Diffuse addresses the fixed-size response length limitation in Diffusion-based Large Language Models (D-LLMs). These models generate tokens in parallel, offering throughput advantages over autoregressive models, but require a predetermined response length. This leads to computational waste on padding tokens if oversized, or truncation and costly re-computations if undersized. The proposed method estimates the appropriate response length per query before inference, enabling compute-budgeted generation. It is model-agnostic and aims to optimize GPU utilization and reduce latency spikes. The paper is available on arXiv with ID 2605.04215.

Key facts

Diffusion-based LLMs (D-LLMs) generate tokens in parallel, unlike autoregressive models.
D-LLMs require a fixed response length before generation.
Oversized response length wastes computation on padding tokens.
Undersized response length causes truncation and costly re-computations.
Predict-then-Diffuse estimates response length per input query.
The framework is model-agnostic and enables compute-budgeted inference.
The paper is published on arXiv with ID 2605.04215.
The method aims to improve GPU utilization and reduce latency spikes.

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Key facts

Entities

Institutions

Sources