TTKV: Human-Memory-Inspired KV Cache for Long-Context LLMs
A team of researchers has introduced TTKV, a framework for managing key-value caches in large language models that emulates human memory. TTKV organizes the KV cache into different temporal tiers, each with varying capacities and precisions. It tackles the layout of tiers by separating fast HBM from slower DRAM, manages the content of tiers by placing recent KV states in quicker tiers based on their temporal closeness, and enhances tier interactions. This strategy aims to minimize the memory footprint that scales with context length, which is a significant challenge in long-context LLM inference. The findings are published in arXiv preprint 2604.19769.
Key facts
- TTKV is a KV cache management framework for LLMs.
- It is inspired by human memory systems.
- The KV cache is partitioned into temporal tiers.
- Tiers have heterogeneous capacity and precision.
- Tier layout decouples fast HBM and slow DRAM.
- Recent KV states are assigned to faster, higher-precision tiers.
- The approach addresses memory footprint scaling with context length.
- The paper is on arXiv with ID 2604.19769.
Entities
Institutions
- arXiv