CacheTTL: KV Cache Time-to-Live for Multi-Turn LLM Agent Scheduling
A novel system named CacheTTL enhances the efficiency of job completion for multi-turn LLM agent tasks by preserving the KV cache during tool invocations. Current inference engines discard the KV cache of completed requests when new ones arrive, which disrupts agentic workloads that alternate between LLM calls and tools, resulting in delays that hinder effective KV reuse. CacheTTL takes into account both the potential costs associated with recomputation or reloading (if offloading is activated) and the increased queuing times following eviction from the GPU. This approach proves effective even when tool call durations exhibit limited predictability due to internal fluctuations. The details of this system can be found in arXiv paper 2511.02230.
Key facts
- CacheTTL is a serving system for multi-turn LLM agent workloads.
- It retains KV cache during tool calls to improve efficiency.
- Existing inference engines evict KV cache of finished requests when new requests wait.
- Agentic workloads interleave LLM calls with tools, causing pauses.
- Tool calls are often shorter than human multi-turn chatbot responses.
- CacheTTL considers recomputation/reloading costs and queueing delays.
- The method is robust to variance in tool call durations.
- The paper is available on arXiv with ID 2511.02230.
Entities
Institutions
- arXiv