ARTFEED — Contemporary Art Intelligence

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

ai-technology · 2026-05-07

The newly introduced QKVShare framework facilitates the quantized KV-cache transfer among agents in multi-agent LLM systems operating on edge devices. It integrates mixed-precision allocation at the token level, a self-sufficient CacheCard representation, and a cache injection method compatible with HuggingFace. When tested on 150 GSM8K problems using Llama-3.1-8B-Instruct, adaptive quantization proves to be effective during repeated handoffs, particularly outperforming uniform quantization in deeper-hop and higher budget scenarios. Regarding handoff latency, QKVShare achieves a reduction in TTFT compared to complete re-prefill across all tested contexts: 130.7 ms versus 150.2 ms at a nominal 1K context and 397.1 ms versus 1029.7 ms at a nominal 8K context. Timing analysis indicates that latency is primarily influenced by post-injection generation rather than cache transfer.

Key facts

  • QKVShare is a framework for quantized KV-cache handoff between agents.
  • It uses token-level mixed-precision allocation, CacheCard representation, and HuggingFace-compatible cache injection.
  • Tested on 150 GSM8K problems with Llama-3.1-8B-Instruct.
  • Adaptive quantization shows clearest gains against uniform quantization in deeper-hop, higher budget settings.
  • QKVShare reduces TTFT relative to full re-prefill at all tested contexts.
  • At 1K context: 130.7 ms vs. 150.2 ms.
  • At 8K context: 397.1 ms vs. 1029.7 ms.
  • Post-injection generation dominates latency, not cache transfer.

Entities

Institutions

  • HuggingFace

Sources