HeadQ: KV-Cache Quantization via Model-Visible Distortion Correction
A novel technique known as HeadQ enhances KV-cache quantization by evaluating error in coordinates visible to the model instead of focusing on storage-space reconstruction. For the keys, HeadQ incorporates a low-rank residual side code within a calibration-learned query basis, utilizing it as an additive logit adjustment. In terms of values, it employs an A²-weighted token-distortion surrogate. Research conducted on six models indicates that Fisher/score-space error is a more accurate predictor of attention KL divergence compared to raw key MSE. The method's reliability is confirmed through same-budget counterexamples, null-space interventions, query-PCA controls, and incorrect-sign HeadQ tests. Experiments with dense KV-cache decoding on WikiText-103 further validate the method's effectiveness.
Key facts
- HeadQ is a key-side method for KV-cache quantization.
- It uses a low-rank residual side code in a calibration-learned query basis.
- The method applies additive logit correction for keys.
- For values, it employs an A²-weighted token-distortion surrogate.
- Experiments were conducted across six models.
- Fisher/score-space error predicts attention KL better than raw key MSE.
- Validation includes same-budget counterexamples and null-space interventions.
- Matched Pythia checkpoints identify a route-flip boundary anomaly.
- Dense decode experiments were performed on WikiText-103.
Entities
—