OCTOPUS: Optimized KV Cache Compression via Octahedral Parametrization
A novel technique named OCTOPUS enhances the compression of key-value (KV) caches for transformers during long-context autoregressive inference. This approach builds upon earlier rotation-preconditioned codecs like TurboQuant and PolarQuant by jointly quantizing rotated coordinate triplets. The direction of each triplet is represented as a square through octahedral parameterization, followed by Lloyd-Max quantization of two coordinates along with the triplet norm. This process allows for non-uniform bit allocation that relies solely on key dimensionality, achieving optimal squared error. The codec operates in a data-oblivious, online, and deterministic manner. The findings are available in a paper on arXiv (2605.21226).
Key facts
- OCTOPUS optimizes KV cache compression for transformers.
- It uses octahedral parameterization to map triplet directions to a square.
- Lloyd-Max quantization is applied to two coordinates and the triplet norm.
- Bit allocation is non-uniform and depends only on key dimensionality.
- The codec is data-oblivious, online, and deterministic.
- It builds on TurboQuant and PolarQuant methods.
- The paper is on arXiv with ID 2605.21226.
- It targets long-context autoregressive inference.
Entities
Institutions
- arXiv