ARTFEED — Contemporary Art Intelligence

Irminsul: Position-Independent Caching for Agentic LLM Serving

ai-technology · 2026-05-09

Irminsul, a newly developed caching system, tackles cache-hit regressions in agentic LLM workloads, where identical tokens are positioned differently each turn, rendering prefix caches ineffective. Operators have noted significant slowdowns, with time-to-first-token (TTFT) delays reaching 10-16 seconds even with unchanged content. Previous position-independent caching systems addressed Rotary Position Embedding (RoPE) across the entire key dimension, incurring an architectural cost due to Grouped Query Attention (GQA). In contrast, Multi-Head Latent Attention (MLA), utilized in models like DeepSeek-V2/V3/R1 and Kimi-K2/Moonlight, separates each KV row into a position-free c_KV and a correctable 64-dimensional k_r. Irminsul enhances SGLang's radix cache using content-hash keying over Content-Defined Chunking (CDC) segments and a delta-rotation rule for k_r. The system is tested on three native MLA-MoE implementations, including DeepSeek-V2-Lite (16B/2.4B) and Kimi.

Key facts

  • Irminsul is a position-independent caching system for agentic LLM serving.
  • Agentic workloads cause bit-identical tokens at shifted positions, voiding prefix caches.
  • Operators report TTFT spikes of 10-16 seconds on unchanged content.
  • Prior systems correct RoPE on full key dimension, an architectural cost from GQA.
  • MLA factors KV rows into position-free c_KV and 64-dim k_r correctable in closed form.
  • MLA is deployed in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3.
  • Irminsul extends SGLang's radix cache with content-hash keying over CDC segments.
  • Irminsul uses a delta-rotation rule for k_r.
  • Evaluated on DeepSeek-V2-Lite (16B/2.4B), Kimi.

Entities

Institutions

  • DeepSeek
  • Kimi
  • GLM
  • Mistral
  • SGLang

Sources