ARTFEED — Contemporary Art Intelligence

First NPU Accelerator Designed for Diffusion-Based LLMs

ai-technology · 2026-04-25

A new NPU accelerator designed specifically for diffusion-based large language models (dLLMs) has just been launched. Unlike traditional autoregressive LLMs, dLLMs make use of bidirectional attention and refresh their block-wise KV cache, along with cross-step reuse, and a unique sampling phase that doesn’t depend on GEMM. This means that the current NPUs, which are optimized for autoregressive models, can't effectively work with dLLMs. The accelerator comes equipped with a dLLM-specific instruction set architecture (ISA) and compiler, and it includes specialized hardware for the sampling phase, which is both reduction-heavy and focused on top-k results, adapting KV quantization at every step. This development caters to the specific computational needs of dLLMs, allowing for efficient inference on specialized hardware.

Key facts

  • First NPU accelerator designed specifically for diffusion-based LLMs (dLLMs).
  • dLLMs use bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and non-GEMM-centric sampling.
  • Existing NPUs are incompatible with dLLMs due to different inference patterns.
  • New accelerator features a dLLM-oriented ISA and compiler.
  • Hardware supports reduction-heavy, top-k-driven sampling stage.
  • Addresses step-wise distribution shifts in KV quantization.
  • Published on arXiv with ID 2601.20706.
  • Replaces previous version (cross-ref).

Entities

Institutions

  • arXiv

Sources