ProxyKV: Cross-Model KV Cache Pruning for Long-Context LLMs

ai-technology · 2026-05-20

A new framework called ProxyKV addresses the memory bottleneck of Key-Value (KV) cache in long-context Large Language Model (LLM) inference. Existing pruning methods either use low-latency heuristics that sacrifice accuracy or high-precision reconstruction with prohibitive prefilling overhead. ProxyKV bridges this gap by offloading importance scoring to a lightweight Small-Model Proxy from the same model family, which runs asynchronously to the Large-Model Target. To handle architectural differences between models, the authors introduce HybridAxialMapper, which separates temporal feature extraction from cross-head alignment, and a Multi-Granularity Hybrid Loss that shifts learning from regression to relative ranking consistency. The framework was evaluated across Llama-3.1, Qwen-2.5, and Qwen-3 families, with target sizes from 7B to 32B parameters. The paper is available on arXiv under identifier 2605.16360.

Key facts

ProxyKV uses a cross-model proxy pruning framework for KV cache.
It offloads importance scoring to a lightweight Small-Model Proxy.
The proxy runs asynchronously to the Large-Model Target.
HybridAxialMapper disentangles temporal and cross-head features.
Multi-Granularity Hybrid Loss uses relative ranking consistency.
Evaluated on Llama-3.1, Qwen-2.5, and Qwen-3 families.
Target sizes range from 7B to 32B parameters.
Paper available at arXiv:2605.16360.

ProxyKV: Cross-Model KV Cache Pruning for Long-Context LLMs

Key facts

Entities

Institutions

Sources