TokenButler Predicts Critical Tokens in LLM KV-Cache
Researchers have introduced TokenButler, a query-aware predictor that identifies critical tokens in the Key-Value (KV) Cache of Large Language Models (LLMs). The KV-Cache stores token history for efficient decoding, but grows into a memory and computation bottleneck. Prior work shows only a small subset of tokens are meaningful per decoding step, but these tokens are dynamic and input-dependent. Existing methods either permanently evict tokens, risking quality, or retain the full cache with retrieval-based sparsity using inaccurate proxies. TokenButler learns to predict low-dimensional importance queries at a fixed depth stride, enabling high-granularity, query-aware token selection. The paper is available on arXiv under ID 2503.07518.
Key facts
- TokenButler is a query-aware predictor for critical tokens in LLM KV-Cache.
- KV-Cache stores token history for efficient decoding but becomes a bottleneck.
- Only a small subset of tokens contribute meaningfully to each decoding step.
- Critical tokens are dynamic and heavily input query-dependent.
- Existing methods either evict tokens permanently or use inaccurate proxies.
- TokenButler predicts low-dimensional importance queries at a fixed depth stride.
- The paper is on arXiv with ID 2503.07518.
- TokenButler offers high-granularity, query-aware token selection.
Entities
Institutions
- arXiv