ARTFEED — Contemporary Art Intelligence

ProxyKV: Cross-Model KV Cache Pruning for Long-Context LLMs

ai-technology · 2026-05-20

A new framework called ProxyKV addresses the memory bottleneck of Key-Value (KV) cache in long-context Large Language Model (LLM) inference. Existing pruning methods either use low-latency heuristics that sacrifice accuracy or high-precision reconstruction with prohibitive prefilling overhead. ProxyKV bridges this gap by offloading importance scoring to a lightweight Small-Model Proxy from the same model family, which runs asynchronously to the Large-Model Target. To handle architectural differences between models, the authors introduce HybridAxialMapper, which separates temporal feature extraction from cross-head alignment, and a Multi-Granularity Hybrid Loss that shifts learning from regression to relative ranking consistency. The framework was evaluated across Llama-3.1, Qwen-2.5, and Qwen-3 families, with target sizes from 7B to 32B parameters. The paper is available on arXiv under identifier 2605.16360.

Key facts

  • ProxyKV uses a cross-model proxy pruning framework for KV cache.
  • It offloads importance scoring to a lightweight Small-Model Proxy.
  • The proxy runs asynchronously to the Large-Model Target.
  • HybridAxialMapper disentangles temporal and cross-head features.
  • Multi-Granularity Hybrid Loss uses relative ranking consistency.
  • Evaluated on Llama-3.1, Qwen-2.5, and Qwen-3 families.
  • Target sizes range from 7B to 32B parameters.
  • Paper available at arXiv:2605.16360.

Entities

Institutions

  • arXiv

Sources