Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
A new method for compressing KV caches in large language models called Hurwitz Quaternion Multiplicative Quantization (HQMQ) has been developed. This method treats every four-element part of K or V as a quaternion and uses a mix of a 24-element Hurwitz group codebook along with a layer-specific random quaternion codebook to quantize its direction. This approach yields 24S effective codewords while only needing S stored parameters. To tackle outlier issues common in modern architectures, it includes a median-multiplier extraction process for each batch, eliminating the need for calibration. The technique was evaluated on five recent open models, such as Mistral-7B, Llama-3-8B, and Qwen.
Key facts
- HQMQ is a calibration-free method for KV cache compression.
- It treats each 4-element chunk of K or V as a quaternion.
- Quantization uses product of Hurwitz group (24 elements) and secondary random quaternion codebook.
- Effective codewords: 24S with S stored parameters.
- Outlier extraction step uses C=3, no calibration.
- Evaluated on Mistral-7B, Llama-3-8B, Qwen, and two other open models.
- Random initialization suffices due to S^3 isometry.
- Seeded codebooks vary in end-task ppl by less than 1.5%.
Entities
—