NPU Not Always Faster for Mobile LLM Inference: Stage-Level Analysis

ai-technology · 2026-05-28

A recent study published on arXiv (2605.27435) introduces the first multi-level, stage-aware benchmarking of mobile LLM inference on a heterogeneous SoC combining CPU and NPU. By employing an OPMASK-based controlled pipeline decomposition, the researchers managed to separate the overheads related to communication, quantization, and computation within the NPU execution pathway. The findings reveal an unexpected performance shift: CPUs surpass NPUs in the compute-heavy Prefill stage by as much as 1.6x, whereas NPUs only achieve marginal acceleration (1.05-1.2x) during the memory-intensive Decode stage. Additionally, scheduling overhead and cross-backend fallback diminish the practical advantages of NPU offloading, which also results in increased energy usage.

Key facts

First stage-aware, multi-level benchmarking study of mobile LLM inference on CPU-NPU heterogeneous SoC
OPMASK-based controlled pipeline decomposition methodology introduced
CPUs outperform NPUs in Prefill stage by up to 1.6x
NPUs provide only 1.05-1.2x acceleration in Decode stage
Scheduling overhead and cross-backend fallback reduce NPU offloading benefits
Increasing NPU offloading leads to higher energy consumption
Study published on arXiv with ID 2605.27435
No prior study systematically characterized NPU effectiveness at operator and pipeline level

NPU Not Always Faster for Mobile LLM Inference: Stage-Level Analysis

Key facts

Entities

Institutions

Sources