Stochastic KV Routing: Adaptive Cache Sharing for LLMs
A new arXiv paper (2604.22782) proposes Stochastic KV Routing to reduce memory footprint in transformer language model serving. The method leverages the depth dimension for cache optimization, arguing that full cache per layer is redundant. It introduces random cross-layer attention during training to enable efficient cross-layer cache sharing without information loss, addressing throughput and time-to-first-token issues in prior approaches.
Key facts
- arXiv paper 2604.22782 proposes Stochastic KV Routing
- Focuses on reducing KV cache memory in transformer LLMs
- Uses depth dimension for orthogonal optimization
- Random cross-layer attention during training enables cache sharing
- Claims no information loss from dropping a layer's cache
- Addresses throughput and latency issues of prior methods
- Published on arXiv as a cross submission
- Title: Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Entities
Institutions
- arXiv