Stochastic KV Routing: Adaptive Cache Sharing for LLMs

ai-technology · 2026-04-29

A new arXiv paper (2604.22782) proposes Stochastic KV Routing to reduce memory footprint in transformer language model serving. The method leverages the depth dimension for cache optimization, arguing that full cache per layer is redundant. It introduces random cross-layer attention during training to enable efficient cross-layer cache sharing without information loss, addressing throughput and time-to-first-token issues in prior approaches.

Key facts

arXiv paper 2604.22782 proposes Stochastic KV Routing
Focuses on reducing KV cache memory in transformer LLMs
Uses depth dimension for orthogonal optimization
Random cross-layer attention during training enables cache sharing
Claims no information loss from dropping a layer's cache
Addresses throughput and latency issues of prior methods
Published on arXiv as a cross submission
Title: Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Stochastic KV Routing: Adaptive Cache Sharing for LLMs

Key facts

Entities

Institutions

Sources