First NPU Accelerator Designed for Diffusion-Based LLMs
A new NPU accelerator designed specifically for diffusion-based large language models (dLLMs) has just been launched. Unlike traditional autoregressive LLMs, dLLMs make use of bidirectional attention and refresh their block-wise KV cache, along with cross-step reuse, and a unique sampling phase that doesn’t depend on GEMM. This means that the current NPUs, which are optimized for autoregressive models, can't effectively work with dLLMs. The accelerator comes equipped with a dLLM-specific instruction set architecture (ISA) and compiler, and it includes specialized hardware for the sampling phase, which is both reduction-heavy and focused on top-k results, adapting KV quantization at every step. This development caters to the specific computational needs of dLLMs, allowing for efficient inference on specialized hardware.
Key facts
- First NPU accelerator designed specifically for diffusion-based LLMs (dLLMs).
- dLLMs use bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and non-GEMM-centric sampling.
- Existing NPUs are incompatible with dLLMs due to different inference patterns.
- New accelerator features a dLLM-oriented ISA and compiler.
- Hardware supports reduction-heavy, top-k-driven sampling stage.
- Addresses step-wise distribution shifts in KV quantization.
- Published on arXiv with ID 2601.20706.
- Replaces previous version (cross-ref).
Entities
Institutions
- arXiv