Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

other · 2026-05-01

A novel system tackles significant bottlenecks in managing key-value (KV) cache memory for extensive GPU inference. It addresses three main inefficiencies: the absence of a standardized KV cache size across different attention architectures (such as multi-head latent attention, leading to over-provisioning by as much as 57 times), reliance solely on GPU HBM despite other available memory hierarchies (like CPU DRAM, CXL, NVMe, RDMA, and parallel filesystems), and unresponsive eviction strategies. The sizing engine, which considers architectural variations, accurately calculates the required memory for each attention type, facilitating batch sizes that are up to 7.4 times larger.

Key facts

KV cache memory management is the primary bottleneck in large-scale GPU inference.
Current systems lack unified KV cache sizing across attention architectures.
Multi-head latent attention (MLA) is unsupported in general-purpose frameworks, causing up to 57x memory over-provisioning.
KV cache is confined to a single memory tier (GPU HBM) despite available hierarchy.
Reactive eviction policies discard reusable state, forcing redundant recomputation.
The proposed system addresses all three problems.
Architecture-variant-aware sizing engine computes exact memory requirements per attention type.
Enables up to 7.4x higher batch sizes.

Entities

—

Sources

arXiv cs.AI — 2026-05-01