HeadQ: KV-Cache Quantization via Model-Visible Distortion Correction

other · 2026-05-07

A novel technique known as HeadQ enhances KV-cache quantization by evaluating error in coordinates visible to the model instead of focusing on storage-space reconstruction. For the keys, HeadQ incorporates a low-rank residual side code within a calibration-learned query basis, utilizing it as an additive logit adjustment. In terms of values, it employs an A²-weighted token-distortion surrogate. Research conducted on six models indicates that Fisher/score-space error is a more accurate predictor of attention KL divergence compared to raw key MSE. The method's reliability is confirmed through same-budget counterexamples, null-space interventions, query-PCA controls, and incorrect-sign HeadQ tests. Experiments with dense KV-cache decoding on WikiText-103 further validate the method's effectiveness.

Key facts

HeadQ is a key-side method for KV-cache quantization.
It uses a low-rank residual side code in a calibration-learned query basis.
The method applies additive logit correction for keys.
For values, it employs an A²-weighted token-distortion surrogate.
Experiments were conducted across six models.
Fisher/score-space error predicts attention KL better than raw key MSE.
Validation includes same-budget counterexamples and null-space interventions.
Matched Pythia checkpoints identify a route-flip boundary anomaly.
Dense decode experiments were performed on WikiText-103.

Entities

—

Sources

arXiv cs.AI — 2026-05-06