ARTFEED — Contemporary Art Intelligence

NPU Not Always Faster for Mobile LLM Inference: Stage-Level Analysis

ai-technology · 2026-05-28

A recent study published on arXiv (2605.27435) introduces the first multi-level, stage-aware benchmarking of mobile LLM inference on a heterogeneous SoC combining CPU and NPU. By employing an OPMASK-based controlled pipeline decomposition, the researchers managed to separate the overheads related to communication, quantization, and computation within the NPU execution pathway. The findings reveal an unexpected performance shift: CPUs surpass NPUs in the compute-heavy Prefill stage by as much as 1.6x, whereas NPUs only achieve marginal acceleration (1.05-1.2x) during the memory-intensive Decode stage. Additionally, scheduling overhead and cross-backend fallback diminish the practical advantages of NPU offloading, which also results in increased energy usage.

Key facts

  • First stage-aware, multi-level benchmarking study of mobile LLM inference on CPU-NPU heterogeneous SoC
  • OPMASK-based controlled pipeline decomposition methodology introduced
  • CPUs outperform NPUs in Prefill stage by up to 1.6x
  • NPUs provide only 1.05-1.2x acceleration in Decode stage
  • Scheduling overhead and cross-backend fallback reduce NPU offloading benefits
  • Increasing NPU offloading leads to higher energy consumption
  • Study published on arXiv with ID 2605.27435
  • No prior study systematically characterized NPU effectiveness at operator and pipeline level

Entities

Institutions

  • arXiv

Sources