LaProx: Output-Aware KV Cache Eviction for Long-Context LLMs
Researchers have redefined KV Cache eviction for long-context LLM inference by framing it as a problem of output-aware, layer-wise matrix multiplication approximation. Current techniques depend on local attention weights and overlook value representations, output projections, and inter-head interactions. LaProx explicitly captures the multiplicative relationships between attention maps and projected value states to assess token contributions while considering inter-head dependencies. It presents a novel unified eviction strategy that allocates globally comparable importance scores, facilitating model-wide selection rather than localized, head-specific choices. This method effectively minimizes both memory usage and runtime overhead during long-context inference.
Key facts
- arXiv:2605.07234v1
- Reformulates KV Cache eviction as output-aware, layer-wise matrix multiplication approximation
- Existing methods neglect value representations, output projection, and inter-head interactions
- LaProx models multiplicative interaction between attention maps and projected value states
- First unified eviction strategy with globally comparable importance scores
- Enables model-wide token selection instead of local, head-wise decisions
- Reduces memory and runtime overhead for long-context LLM inference
Entities
—