New Fused Metal Kernel Achieves Faster int4 KV Cache Than fp16 on Apple Silicon

ai-technology · 2026-05-09

A research paper on arXiv (2605.05699) demonstrates that KV-cache quantization can be inverted on Apple Silicon's unified memory, achieving faster performance than fp16. The authors developed a single fused Metal kernel combining sign-randomized FFT, per-channel λ, per-group abs-max, and int4 nibble packing, exposed as a HuggingFace Cache subclass. On Gemma-3 1B, it runs faster than fp16 across 256–4096-token prefixes with -3 to -8% ms/tok improvement, and on Qwen2.5-1.5B at short context (up to 1K tokens) with -0.7 to -2.6% improvement. It provides 3× persistent memory compression while preserving quality (ΔPPL = 0.000 for Qwen short-prompt, +3.6 hook ΔPPL for Gemma). The kernel's ~25 ns/vec overhead is below bandwidth savings from compression. It also closes Qwen's 4-bit per-token catastrophe (ΔPPL from +7975 to +638.6, a 12.5× reduction) at 182 GFLOPS / D=128. Supporting findings show SRFT and SRHT are statistically indistinguishable for KV quality.

Key facts

KV-cache quantization is inverted on Apple Silicon's unified memory
Single fused Metal kernel runs faster than fp16 on Gemma-3 1B and Qwen2.5-1.5B
Kernel includes sign-randomized FFT, per-channel λ, per-group abs-max, int4 nibble pack
Exposed as HuggingFace Cache subclass
3× persistent memory compression with quality preserved
ΔPPL = 0.000 for Qwen short-prompt, +3.6 hook ΔPPL for Gemma
Closes Qwen's 4-bit per-token catastrophe: ΔPPL from +7975 to +638.6
SRFT and SRHT are statistically indistinguishable for KV quality

New Fused Metal Kernel Achieves Faster int4 KV Cache Than fp16 on Apple Silicon

Key facts

Entities

Institutions

Sources