Quant.npu: Fully Static Quantization Framework for On-Device LLMs

ai-technology · 2026-05-22

Researchers propose Quant.npu, a fully static quantization framework enabling efficient inference of large language models (LLMs) on mobile devices with Neural Processing Units (NPUs). Existing post-training quantization (PTQ) methods rely on dynamic activation quantization, incompatible with NPU hardware constraints. Quant.npu uses integer-only quantization with learnable parameters and rotation matrices, eliminating runtime re-computation. The study identifies that initialization and selective optimization of quantization parameters are critical for stability, as improper initialization and naive joint optimization cause gradient instability disrupting rotation matrix optimization.

Key facts

Quant.npu is a fully static quantization framework for mobile NPU inference of LLMs.
Existing PTQ methods use dynamic activation quantization, incompatible with NPU constraints.
Quant.npu employs integer-only quantization with learnable parameters and rotation matrices.
It eliminates runtime quantization parameter re-computation.
Initialization and selective optimization of quantization parameters are crucial for stability.
Improper initialization and naive joint optimization cause gradient instability.
The framework enables low-bit activation-weight quantization.
The paper is available on arXiv with ID 2605.20295.

Quant.npu: Fully Static Quantization Framework for On-Device LLMs

Key facts

Entities

Institutions

Sources