SparKV: Adaptive KV Cache Loading for On-Device LLM Inference

ai-technology · 2026-04-25

SparKV is an innovative framework that integrates cloud-based Key-Value (KV) cache streaming with on-device processing to enhance the efficiency of Large Language Model (LLM) inference on devices. It evaluates the cost of each KV chunk, determining whether to stream or compute it locally, thereby minimizing latency through simultaneous processing. Additionally, SparKV adjusts offline-generated schedules in real-time to accommodate variations in wireless connectivity and edge resource availability. Experimental results indicate that it improves Time-to-First-Token by a factor of 1.3x to 5.1x, with minimal effect on response quality, while also decreasing per-request energy consumption by 1.5x to 3.3x.

Key facts

SparKV is an adaptive KV loading framework for on-device LLM inference.
It combines cloud-based KV streaming with on-device computation.
It models cost of individual KV chunks to decide streaming vs local computation.
Execution paths are overlapped to reduce latency.
Runtime refinement of offline schedules handles connectivity and resource fluctuations.
Experiments show 1.3x-5.1x reduction in Time-to-First-Token.
Response quality impact is negligible.
Per-request energy consumption reduced by 1.5x to 3.3x.

Entities

—

Sources

arXiv cs.AI — 2026-04-25