Tensor Cache: A Two-Level Memory for Transformer KV Cache

publication · 2026-05-25

Tensor Cache, an innovative technique, enhances the efficiency of Transformers by integrating sliding-window softmax attention (L1) with a constant-size outer-product fast-weight memory (L2) that utilizes evicted KV pairs. This setup ensures that the most recent tokens maintain precise local attention, while the evicted pairs are condensed into a per-layer matrix A, accessed through a single matrix multiplication, leveraging the linear-attention identity. A learned scalar gate combines the outputs, and parameters for per-head decay and write-rate are trained in an end-to-end manner. Details of this method can be found in the arXiv paper numbered 2605.22884.

Key facts

Tensor Cache is a two-level cache for autoregressive Transformers.
L1 uses sliding-window softmax attention.
L2 uses outer-product fast-weight memory fed by evicted KV pairs.
Evicted pairs are compressed into a per-layer matrix A.
Reading uses a single matrix multiplication via linear-attention identity.
A learned scalar gate fuses L1 and L2 outputs.
Per-head decay and write-rate parameters are trained end-to-end.
Paper available on arXiv: 2605.22884.

Tensor Cache: A Two-Level Memory for Transformer KV Cache

Key facts

Entities

Institutions

Sources