DUAL-BLADE: NVMe-Direct KV-Cache Offloading for Edge LLM Inference
DUAL-BLADE is a dual-path KV residency architecture designed for edge LLM inference, which intelligently allocates KV tensors to either a page-cache route or an NVMe-direct route, depending on the available memory during runtime. The NVMe-direct route eliminates the need for a filesystem by associating KV tensors with contiguous logical block address regions, facilitating low-overhead access to storage. Additionally, it features adaptive pipeline parallelism to synchronize storage I/O with GPU DMA, thereby enhancing inference throughput. This system tackles the issue of KV caches surpassing device memory in edge AI applications, where traditional file-based NVMe offloading faces problems like cache thrashing and significant software overhead. The research is available on arXiv under ID 2604.26557.
Key facts
- DUAL-BLADE is a dual-path KV residency framework for edge LLM inference.
- It dynamically assigns KV tensors to a page-cache path or NVMe-direct path.
- The NVMe-direct path maps KV tensors to contiguous logical block address regions.
- It bypasses the filesystem for low-overhead direct storage access.
- Adaptive pipeline parallelism overlaps storage I/O with GPU DMA.
- The system targets edge AI systems with tight memory budgets.
- Existing file-based designs rely on kernel page cache, causing cache thrashing.
- Paper is arXiv:2604.26557.
Entities
Institutions
- arXiv