LaProx: Output-Aware KV Cache Eviction for Long-Context LLMs

ai-technology · 2026-05-11

Researchers have redefined KV Cache eviction for long-context LLM inference by framing it as a problem of output-aware, layer-wise matrix multiplication approximation. Current techniques depend on local attention weights and overlook value representations, output projections, and inter-head interactions. LaProx explicitly captures the multiplicative relationships between attention maps and projected value states to assess token contributions while considering inter-head dependencies. It presents a novel unified eviction strategy that allocates globally comparable importance scores, facilitating model-wide selection rather than localized, head-specific choices. This method effectively minimizes both memory usage and runtime overhead during long-context inference.

Key facts

arXiv:2605.07234v1
Reformulates KV Cache eviction as output-aware, layer-wise matrix multiplication approximation
Existing methods neglect value representations, output projection, and inter-head interactions
LaProx models multiplicative interaction between attention maps and projected value states
First unified eviction strategy with globally comparable importance scores
Enables model-wide token selection instead of local, head-wise decisions
Reduces memory and runtime overhead for long-context LLM inference

Entities

—

Sources

arXiv cs.AI — 2026-05-11