ARTFEED — Contemporary Art Intelligence

DARE: Boosting Diffusion LLM Inference via Activation Reuse

ai-technology · 2026-05-12

Researchers have introduced DARE (Diffusion Language Model Activation Reuse), a method to accelerate inference in diffusion large language models (dLLMs) by exploiting token-wise redundancy in self-attention. The approach comprises two mechanisms: DARE-KV reuses cached key-value activations, while DARE-O reuses output activations, reducing redundant computation without significant quality loss. Experiments show up to 1.20x per-layer latency reduction and reuse of up to 87% of attention activations. The work addresses the current immaturity of open-source dLLMs compared to auto-regressive models, offering potential for faster parallel generation. The paper is available on arXiv under identifier 2605.08134.

Key facts

  • DARE targets diffusion large language models (dLLMs).
  • It exploits token-wise redundancy in bi-directional self-attention.
  • Two mechanisms: DARE-KV and DARE-O.
  • DARE-KV reuses cached key-value activations.
  • DARE-O reuses output activations.
  • Achieves up to 1.20x per-layer latency reduction.
  • Reuses up to 87% of attention activations.
  • Negligible degradation on quality.
  • Paper available on arXiv: 2605.08134.

Entities

Institutions

  • arXiv

Sources