NPU Not Always Faster for Mobile LLM Inference: Stage-Level Analysis
A recent study published on arXiv (2605.27435) introduces the first multi-level, stage-aware benchmarking of mobile LLM inference on a heterogeneous SoC combining CPU and NPU. By employing an OPMASK-based controlled pipeline decomposition, the researchers managed to separate the overheads related to communication, quantization, and computation within the NPU execution pathway. The findings reveal an unexpected performance shift: CPUs surpass NPUs in the compute-heavy Prefill stage by as much as 1.6x, whereas NPUs only achieve marginal acceleration (1.05-1.2x) during the memory-intensive Decode stage. Additionally, scheduling overhead and cross-backend fallback diminish the practical advantages of NPU offloading, which also results in increased energy usage.
Key facts
- First stage-aware, multi-level benchmarking study of mobile LLM inference on CPU-NPU heterogeneous SoC
- OPMASK-based controlled pipeline decomposition methodology introduced
- CPUs outperform NPUs in Prefill stage by up to 1.6x
- NPUs provide only 1.05-1.2x acceleration in Decode stage
- Scheduling overhead and cross-backend fallback reduce NPU offloading benefits
- Increasing NPU offloading leads to higher energy consumption
- Study published on arXiv with ID 2605.27435
- No prior study systematically characterized NPU effectiveness at operator and pipeline level
Entities
Institutions
- arXiv