KVServe: Adaptive KV Cache Compression for LLM Serving
KVServe is an innovative framework designed for adaptive and service-aware KV communication compression tailored for disaggregated LLM serving, as detailed in a paper on arXiv (2605.13734). This framework tackles the challenges posed by KV cache transfer across network and storage boundaries in production environments. Unlike traditional static compression techniques, KVServe adjusts to fluctuations in workload mix, bandwidth, and SLO/quality budgets. It integrates KV compression into a modular strategy space featuring new components and cross-method recomposition, employing a Bayesian Profiling Engine to effectively explore this space and identify a 3D Pareto candidate set. The primary goal of this framework is to enhance efficiency and minimize latency in disaggregated LLM serving.
Key facts
- KVServe is a service-aware and adaptive KV communication compression framework.
- It targets disaggregated LLM serving with PD separation and KV state disaggregation.
- Existing KV compression methods are static and suboptimal under varying service contexts.
- KVServe unifies KV compression into a modular strategy space.
- It introduces a Bayesian Profiling Engine for efficient search.
- The engine distills a 3D Pareto candidate set.
- The paper is available on arXiv with ID 2605.13734.
- The framework adapts to workload mix, bandwidth, and SLO/quality budgets.
Entities
Institutions
- arXiv