OjaKV Framework Introduces Online Adaptation for KV Cache Compression in Large Language Models
A new framework called OjaKV addresses memory bottlenecks in large language models by compressing the key-value cache needed for autoregressive generation. The approach combines a hybrid storage policy with online subspace adaptation to overcome limitations of static compression methods. OjaKV preserves crucial first and most recent tokens at full rank to maintain high-fidelity anchors for attention mechanisms. This innovation responds to the substantial memory demands of models like Llama-3.1-8B, which requires approximately 16GB for its KV cache when processing 32K-token prompts at batch size 4—exceeding the model's own weight storage. Traditional low-rank projection methods suffer from poor performance under data distribution shifts due to their reliance on offline-learned subspaces. The framework was detailed in arXiv preprint 2509.21623v2, which announced a replacement cross-version. By strategically determining which tokens to compress, OjaKV enables more efficient long-context processing while maintaining model accuracy.
Key facts
- OjaKV is a novel framework for key-value cache compression in large language models
- It uses a hybrid storage policy combined with online subspace adaptation
- The framework preserves first and most recent tokens at full rank as attention anchors
- Llama-3.1-8B requires approximately 16GB for KV cache with 32K-token prompts at batch size 4
- This KV cache size exceeds the model's own weight storage requirements
- Existing compression methods rely on static, offline-learned subspaces
- Static methods perform poorly under data distribution shifts
- The research was published as arXiv preprint 2509.21623v2 with replace-cross announcement type
Entities
—