ARTFEED — Contemporary Art Intelligence

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

ai-technology · 2026-05-07

A new framework called Predict-then-Diffuse addresses the fixed-size response length limitation in Diffusion-based Large Language Models (D-LLMs). These models generate tokens in parallel, offering throughput advantages over autoregressive models, but require a predetermined response length. This leads to computational waste on padding tokens if oversized, or truncation and costly re-computations if undersized. The proposed method estimates the appropriate response length per query before inference, enabling compute-budgeted generation. It is model-agnostic and aims to optimize GPU utilization and reduce latency spikes. The paper is available on arXiv with ID 2605.04215.

Key facts

  • Diffusion-based LLMs (D-LLMs) generate tokens in parallel, unlike autoregressive models.
  • D-LLMs require a fixed response length before generation.
  • Oversized response length wastes computation on padding tokens.
  • Undersized response length causes truncation and costly re-computations.
  • Predict-then-Diffuse estimates response length per input query.
  • The framework is model-agnostic and enables compute-budgeted inference.
  • The paper is published on arXiv with ID 2605.04215.
  • The method aims to improve GPU utilization and reduce latency spikes.

Entities

Institutions

  • arXiv

Sources