First NPU Accelerator Designed for Diffusion-Based LLMs

ai-technology · 2026-04-25

A new NPU accelerator designed specifically for diffusion-based large language models (dLLMs) has just been launched. Unlike traditional autoregressive LLMs, dLLMs make use of bidirectional attention and refresh their block-wise KV cache, along with cross-step reuse, and a unique sampling phase that doesn’t depend on GEMM. This means that the current NPUs, which are optimized for autoregressive models, can't effectively work with dLLMs. The accelerator comes equipped with a dLLM-specific instruction set architecture (ISA) and compiler, and it includes specialized hardware for the sampling phase, which is both reduction-heavy and focused on top-k results, adapting KV quantization at every step. This development caters to the specific computational needs of dLLMs, allowing for efficient inference on specialized hardware.

Key facts

First NPU accelerator designed specifically for diffusion-based LLMs (dLLMs).
dLLMs use bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and non-GEMM-centric sampling.
Existing NPUs are incompatible with dLLMs due to different inference patterns.
New accelerator features a dLLM-oriented ISA and compiler.
Hardware supports reduction-heavy, top-k-driven sampling stage.
Addresses step-wise distribution shifts in KV quantization.
Published on arXiv with ID 2601.20706.
Replaces previous version (cross-ref).

First NPU Accelerator Designed for Diffusion-Based LLMs

Key facts

Entities

Institutions

Sources