TokenButler Predicts Critical Tokens in LLM KV-Cache

ai-technology · 2026-05-18

Researchers have introduced TokenButler, a query-aware predictor that identifies critical tokens in the Key-Value (KV) Cache of Large Language Models (LLMs). The KV-Cache stores token history for efficient decoding, but grows into a memory and computation bottleneck. Prior work shows only a small subset of tokens are meaningful per decoding step, but these tokens are dynamic and input-dependent. Existing methods either permanently evict tokens, risking quality, or retain the full cache with retrieval-based sparsity using inaccurate proxies. TokenButler learns to predict low-dimensional importance queries at a fixed depth stride, enabling high-granularity, query-aware token selection. The paper is available on arXiv under ID 2503.07518.

Key facts

TokenButler is a query-aware predictor for critical tokens in LLM KV-Cache.
KV-Cache stores token history for efficient decoding but becomes a bottleneck.
Only a small subset of tokens contribute meaningfully to each decoding step.
Critical tokens are dynamic and heavily input query-dependent.
Existing methods either evict tokens permanently or use inaccurate proxies.
TokenButler predicts low-dimensional importance queries at a fixed depth stride.
The paper is on arXiv with ID 2503.07518.
TokenButler offers high-granularity, query-aware token selection.

TokenButler Predicts Critical Tokens in LLM KV-Cache

Key facts

Entities

Institutions

Sources