GhostServe: Fault-Tolerant LLM Serving via Shadow Checkpointing
GhostServe is a novel checkpointing system designed to protect the key-value (KV) cache in large language model (LLM) inference services from hardware and software faults. As million-token, agent-based applications become more prevalent, the long-running nature of these tasks increases vulnerability to failures, causing costly job interruptions and resource waste. GhostServe addresses this by applying erasure coding to generate parity shards stored in host memory, enabling fast reconstruction of lost KV cache without full recomputation. The system operates in the shadow, meaning it runs transparently alongside the main inference process. This approach ensures seamless resumption of inference after device failures, significantly improving fault tolerance for distributed LLM serving. The work is presented in arXiv preprint 2605.00831.
Key facts
- GhostServe uses erasure coding to protect the KV cache.
- Parity shards are stored in host memory.
- Enables fast reconstruction of lost KV cache after device failures.
- Eliminates need for costly full recomputation.
- Designed for million-token, agent-based LLM applications.
- Operates in the shadow, transparent to the main process.
- Addresses hardware and software faults in distributed serving.
- Published as arXiv:2605.00831.
Entities
Institutions
- arXiv