Sparse Prefix Caching Optimizes Hybrid and Recurrent LLM Serving
A recent publication on arXiv (2605.05219) presents a method for sparse prefix caching tailored for hybrid and recurrent large language model (LLM) deployment. Unlike conventional autoregressive models that depend on dense key/value caching for each token, state-space models (SSMs) allow for the resumption from a singular saved recurrent state. This unique characteristic facilitates an innovative caching approach: by storing precise recurrent states at strategically sparse checkpoints, the model can resume from the most profound stored checkpoint upon a cache hit and accurately recompute the suffix. The authors define this as a checkpoint placement challenge based on a distribution of overlap depths, providing an exact O(NM) dynamic programming solution. In scenarios where requests share significant prefixes, like querying various aspects of a lengthy document, this technique enhances the Pareto frontier compared to standard heuristics using actual data from QuALITY and Sys.
Key facts
- arXiv paper 2605.05219 introduces sparse prefix caching for hybrid and recurrent LLM serving.
- State-space models can resume from a single stored recurrent state, unlike dense per-token caching.
- The method stores exact recurrent states at sparse checkpoint positions.
- On a cache hit, the system resumes from the deepest stored checkpoint and recomputes the remaining suffix.
- The approach is formalized as a checkpoint placement problem with an O(NM) dynamic program.
- It improves the Pareto frontier over standard heuristics on QuALITY and Sys datasets.
- The technique benefits scenarios where requests share a non-trivial prefix.
- The paper is a cross submission type.
Entities
Institutions
- arXiv