SplitZip: GPU-Friendly Lossless KV Compression for LLM Serving
SplitZip is a GPU-friendly lossless compressor designed to accelerate KV-cache transfer in disaggregated LLM serving systems. Contemporary LLM serving architectures separate prefill and decode phases across different workers, requiring KV cache transfer from prefill to decode workers. This transfer becomes a bottleneck, especially for long-input and agentic workloads. Existing lossless codecs target offline weight compression, rely on CPU, or use variable-length coding that decompresses fast but compresses slowly for online use. SplitZip exploits redundancy in floating-point exponents of KV activations, encoding them efficiently on GPU. It achieves ultra-fast compression and decompression speeds suitable for online serving.
Key facts
- SplitZip is a lossless compressor for KV-cache transfer in disaggregated LLM serving.
- It targets the bottleneck of transferring KV cache from prefill to decode workers.
- Existing codecs are unsuitable due to CPU reliance or slow compression.
- SplitZip exploits redundancy in floating-point exponents of KV activations.
- It is GPU-friendly and designed for online use.
- The work is published on arXiv with ID 2605.01708.
- It addresses long-input and agentic workloads.
- The compressor is lossless.
Entities
Institutions
- arXiv