Irminsul: Position-Independent Caching for Agentic LLM Serving
Irminsul, a newly developed caching system, tackles cache-hit regressions in agentic LLM workloads, where identical tokens are positioned differently each turn, rendering prefix caches ineffective. Operators have noted significant slowdowns, with time-to-first-token (TTFT) delays reaching 10-16 seconds even with unchanged content. Previous position-independent caching systems addressed Rotary Position Embedding (RoPE) across the entire key dimension, incurring an architectural cost due to Grouped Query Attention (GQA). In contrast, Multi-Head Latent Attention (MLA), utilized in models like DeepSeek-V2/V3/R1 and Kimi-K2/Moonlight, separates each KV row into a position-free c_KV and a correctable 64-dimensional k_r. Irminsul enhances SGLang's radix cache using content-hash keying over Content-Defined Chunking (CDC) segments and a delta-rotation rule for k_r. The system is tested on three native MLA-MoE implementations, including DeepSeek-V2-Lite (16B/2.4B) and Kimi.
Key facts
- Irminsul is a position-independent caching system for agentic LLM serving.
- Agentic workloads cause bit-identical tokens at shifted positions, voiding prefix caches.
- Operators report TTFT spikes of 10-16 seconds on unchanged content.
- Prior systems correct RoPE on full key dimension, an architectural cost from GQA.
- MLA factors KV rows into position-free c_KV and 64-dim k_r correctable in closed form.
- MLA is deployed in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3.
- Irminsul extends SGLang's radix cache with content-hash keying over CDC segments.
- Irminsul uses a delta-rotation rule for k_r.
- Evaluated on DeepSeek-V2-Lite (16B/2.4B), Kimi.
Entities
Institutions
- DeepSeek
- Kimi
- GLM
- Mistral
- SGLang