FibQuant: New Vector Quantizer for KV-Cache Compression in AI
FibQuant is an innovative technique designed to alleviate memory limitations in long-context AI inference by compressing the key-value (KV) cache. This cache expands with context length, batch size, layers, and heads, and is accessed during each decoding step, resulting in significant memory traffic constraints. Current rotation-based scalar codecs retain a norm, utilize a shared random rotation, and quantize one coordinate sequentially, but they lose geometric information during normalization. Following a Haar rotation, a sequence of k consecutive coordinates transforms into a spherical-Beta source on the unit ball. FibQuant serves as a universal fixed-rate vector quantizer, preserving the normalize-rotate-store interface while substituting scalar tables with a shared radial-angular codebook tailored to this canonical source. This codebook integrates Beta-quantile radii, Fibonacci/Roberts–Kronecker quasi-uniform directions, and multi-restart Lloyd optimization. The research is available on arXiv under ID 2605.11478.
Key facts
- FibQuant is a universal fixed-rate vector quantizer for KV-cache compression.
- It addresses memory traffic problems in long-context AI inference.
- The KV cache grows with context length, batch size, layers, and heads.
- Existing rotation-based scalar codecs discard geometry from normalization.
- After a Haar rotation, a block of k coordinates is a spherical-Beta source.
- FibQuant uses a shared radial-angular codebook matched to this source.
- The codebook combines Beta-quantile radii and Fibonacci/Roberts–Kronecker directions.
- The paper is on arXiv with ID 2605.11478.
Entities
Institutions
- arXiv