Tensor Cache: A Two-Level Memory for Transformer KV Cache
Tensor Cache, an innovative technique, enhances the efficiency of Transformers by integrating sliding-window softmax attention (L1) with a constant-size outer-product fast-weight memory (L2) that utilizes evicted KV pairs. This setup ensures that the most recent tokens maintain precise local attention, while the evicted pairs are condensed into a per-layer matrix A, accessed through a single matrix multiplication, leveraging the linear-attention identity. A learned scalar gate combines the outputs, and parameters for per-head decay and write-rate are trained in an end-to-end manner. Details of this method can be found in the arXiv paper numbered 2605.22884.
Key facts
- Tensor Cache is a two-level cache for autoregressive Transformers.
- L1 uses sliding-window softmax attention.
- L2 uses outer-product fast-weight memory fed by evicted KV pairs.
- Evicted pairs are compressed into a per-layer matrix A.
- Reading uses a single matrix multiplication via linear-attention identity.
- A learned scalar gate fuses L1 and L2 outputs.
- Per-head decay and write-rate parameters are trained end-to-end.
- Paper available on arXiv: 2605.22884.
Entities
Institutions
- arXiv