DARE: Boosting Diffusion LLM Inference via Activation Reuse
Researchers have introduced DARE (Diffusion Language Model Activation Reuse), a method to accelerate inference in diffusion large language models (dLLMs) by exploiting token-wise redundancy in self-attention. The approach comprises two mechanisms: DARE-KV reuses cached key-value activations, while DARE-O reuses output activations, reducing redundant computation without significant quality loss. Experiments show up to 1.20x per-layer latency reduction and reuse of up to 87% of attention activations. The work addresses the current immaturity of open-source dLLMs compared to auto-regressive models, offering potential for faster parallel generation. The paper is available on arXiv under identifier 2605.08134.
Key facts
- DARE targets diffusion large language models (dLLMs).
- It exploits token-wise redundancy in bi-directional self-attention.
- Two mechanisms: DARE-KV and DARE-O.
- DARE-KV reuses cached key-value activations.
- DARE-O reuses output activations.
- Achieves up to 1.20x per-layer latency reduction.
- Reuses up to 87% of attention activations.
- Negligible degradation on quality.
- Paper available on arXiv: 2605.08134.
Entities
Institutions
- arXiv