OCTOPUS: Optimized KV Cache Compression via Octahedral Parametrization

other · 2026-05-22

A novel technique named OCTOPUS enhances the compression of key-value (KV) caches for transformers during long-context autoregressive inference. This approach builds upon earlier rotation-preconditioned codecs like TurboQuant and PolarQuant by jointly quantizing rotated coordinate triplets. The direction of each triplet is represented as a square through octahedral parameterization, followed by Lloyd-Max quantization of two coordinates along with the triplet norm. This process allows for non-uniform bit allocation that relies solely on key dimensionality, achieving optimal squared error. The codec operates in a data-oblivious, online, and deterministic manner. The findings are available in a paper on arXiv (2605.21226).

Key facts

OCTOPUS optimizes KV cache compression for transformers.
It uses octahedral parameterization to map triplet directions to a square.
Lloyd-Max quantization is applied to two coordinates and the triplet norm.
Bit allocation is non-uniform and depends only on key dimensionality.
The codec is data-oblivious, online, and deterministic.
It builds on TurboQuant and PolarQuant methods.
The paper is on arXiv with ID 2605.21226.
It targets long-context autoregressive inference.

OCTOPUS: Optimized KV Cache Compression via Octahedral Parametrization

Key facts

Entities

Institutions

Sources