DARE: Boosting Diffusion LLM Inference via Activation Reuse

ai-technology · 2026-05-12

Researchers have introduced DARE (Diffusion Language Model Activation Reuse), a method to accelerate inference in diffusion large language models (dLLMs) by exploiting token-wise redundancy in self-attention. The approach comprises two mechanisms: DARE-KV reuses cached key-value activations, while DARE-O reuses output activations, reducing redundant computation without significant quality loss. Experiments show up to 1.20x per-layer latency reduction and reuse of up to 87% of attention activations. The work addresses the current immaturity of open-source dLLMs compared to auto-regressive models, offering potential for faster parallel generation. The paper is available on arXiv under identifier 2605.08134.

Key facts

DARE targets diffusion large language models (dLLMs).
It exploits token-wise redundancy in bi-directional self-attention.
Two mechanisms: DARE-KV and DARE-O.
DARE-KV reuses cached key-value activations.
DARE-O reuses output activations.
Achieves up to 1.20x per-layer latency reduction.
Reuses up to 87% of attention activations.
Negligible degradation on quality.
Paper available on arXiv: 2605.08134.

DARE: Boosting Diffusion LLM Inference via Activation Reuse

Key facts

Entities

Institutions

Sources