DUAL-BLADE: NVMe-Direct KV-Cache Offloading for Edge LLM Inference

ai-technology · 2026-04-30

DUAL-BLADE is a dual-path KV residency architecture designed for edge LLM inference, which intelligently allocates KV tensors to either a page-cache route or an NVMe-direct route, depending on the available memory during runtime. The NVMe-direct route eliminates the need for a filesystem by associating KV tensors with contiguous logical block address regions, facilitating low-overhead access to storage. Additionally, it features adaptive pipeline parallelism to synchronize storage I/O with GPU DMA, thereby enhancing inference throughput. This system tackles the issue of KV caches surpassing device memory in edge AI applications, where traditional file-based NVMe offloading faces problems like cache thrashing and significant software overhead. The research is available on arXiv under ID 2604.26557.

Key facts

DUAL-BLADE is a dual-path KV residency framework for edge LLM inference.
It dynamically assigns KV tensors to a page-cache path or NVMe-direct path.
The NVMe-direct path maps KV tensors to contiguous logical block address regions.
It bypasses the filesystem for low-overhead direct storage access.
Adaptive pipeline parallelism overlaps storage I/O with GPU DMA.
The system targets edge AI systems with tight memory budgets.
Existing file-based designs rely on kernel page cache, causing cache thrashing.
Paper is arXiv:2604.26557.

DUAL-BLADE: NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Key facts

Entities

Institutions

Sources