ARTFEED — Contemporary Art Intelligence

Tensor Cache: A Two-Level Memory for Transformer KV Cache

publication · 2026-05-25

Tensor Cache, an innovative technique, enhances the efficiency of Transformers by integrating sliding-window softmax attention (L1) with a constant-size outer-product fast-weight memory (L2) that utilizes evicted KV pairs. This setup ensures that the most recent tokens maintain precise local attention, while the evicted pairs are condensed into a per-layer matrix A, accessed through a single matrix multiplication, leveraging the linear-attention identity. A learned scalar gate combines the outputs, and parameters for per-head decay and write-rate are trained in an end-to-end manner. Details of this method can be found in the arXiv paper numbered 2605.22884.

Key facts

  • Tensor Cache is a two-level cache for autoregressive Transformers.
  • L1 uses sliding-window softmax attention.
  • L2 uses outer-product fast-weight memory fed by evicted KV pairs.
  • Evicted pairs are compressed into a per-layer matrix A.
  • Reading uses a single matrix multiplication via linear-attention identity.
  • A learned scalar gate fuses L1 and L2 outputs.
  • Per-head decay and write-rate parameters are trained end-to-end.
  • Paper available on arXiv: 2605.22884.

Entities

Institutions

  • arXiv

Sources