FibQuant: New Vector Quantizer for KV-Cache Compression in AI

ai-technology · 2026-05-13

FibQuant is an innovative technique designed to alleviate memory limitations in long-context AI inference by compressing the key-value (KV) cache. This cache expands with context length, batch size, layers, and heads, and is accessed during each decoding step, resulting in significant memory traffic constraints. Current rotation-based scalar codecs retain a norm, utilize a shared random rotation, and quantize one coordinate sequentially, but they lose geometric information during normalization. Following a Haar rotation, a sequence of k consecutive coordinates transforms into a spherical-Beta source on the unit ball. FibQuant serves as a universal fixed-rate vector quantizer, preserving the normalize-rotate-store interface while substituting scalar tables with a shared radial-angular codebook tailored to this canonical source. This codebook integrates Beta-quantile radii, Fibonacci/Roberts–Kronecker quasi-uniform directions, and multi-restart Lloyd optimization. The research is available on arXiv under ID 2605.11478.

Key facts

FibQuant is a universal fixed-rate vector quantizer for KV-cache compression.
It addresses memory traffic problems in long-context AI inference.
The KV cache grows with context length, batch size, layers, and heads.
Existing rotation-based scalar codecs discard geometry from normalization.
After a Haar rotation, a block of k coordinates is a spherical-Beta source.
FibQuant uses a shared radial-angular codebook matched to this source.
The codebook combines Beta-quantile radii and Fibonacci/Roberts–Kronecker directions.
The paper is on arXiv with ID 2605.11478.

FibQuant: New Vector Quantizer for KV-Cache Compression in AI

Key facts

Entities

Institutions

Sources