ARTFEED — Contemporary Art Intelligence

HeadQ: KV-Cache Quantization via Model-Visible Distortion Correction

other · 2026-05-07

A novel technique known as HeadQ enhances KV-cache quantization by evaluating error in coordinates visible to the model instead of focusing on storage-space reconstruction. For the keys, HeadQ incorporates a low-rank residual side code within a calibration-learned query basis, utilizing it as an additive logit adjustment. In terms of values, it employs an A²-weighted token-distortion surrogate. Research conducted on six models indicates that Fisher/score-space error is a more accurate predictor of attention KL divergence compared to raw key MSE. The method's reliability is confirmed through same-budget counterexamples, null-space interventions, query-PCA controls, and incorrect-sign HeadQ tests. Experiments with dense KV-cache decoding on WikiText-103 further validate the method's effectiveness.

Key facts

  • HeadQ is a key-side method for KV-cache quantization.
  • It uses a low-rank residual side code in a calibration-learned query basis.
  • The method applies additive logit correction for keys.
  • For values, it employs an A²-weighted token-distortion surrogate.
  • Experiments were conducted across six models.
  • Fisher/score-space error predicts attention KL better than raw key MSE.
  • Validation includes same-budget counterexamples and null-space interventions.
  • Matched Pythia checkpoints identify a route-flip boundary anomaly.
  • Dense decode experiments were performed on WikiText-103.

Entities

Sources