ObjectCache: KV Cache in S3-Compatible Object Storage for LLMs
The newly introduced ObjectCache system, detailed in arXiv:2605.22850, utilizes S3-compatible object storage for large language model (LLM) KV caches instead of costly remote DRAM pools. This innovative strategy seeks to cut down both the size and expenses of serving clusters while keeping the time to first token (TTFT) impact minimal. It integrates the design of the storage protocol and transfer schedule, ensuring that KV cache data is provided in the sequence needed by the GPU, facilitating simultaneous data transfer and computation across multiple requests. A prototype was developed on a 100 Gbps RoCE cluster utilizing NIXL, an inference library that simplifies storage and memory management. The paper presents a viable alternative to existing prefix KV caching techniques that depend on remote DRAM due to limitations in GPU and local DRAM.
Key facts
- ObjectCache stores KV cache in S3-compatible object storage
- Aims to reduce serving-cluster size and cost
- Minimizes impact on time to first token (TTFT)
- Co-designs storage protocol and transfer schedule
- Delivers KV cache data in GPU consumption order
- Overlaps data transfer with compute across concurrent requests
- Prototype built on 100 Gbps RoCE cluster with NIXL
- Paper published on arXiv with ID 2605.22850
Entities
Institutions
- arXiv