OjaKV Framework Introduces Online Adaptation for KV Cache Compression in Large Language Models

ai-technology · 2026-04-20

A new framework called OjaKV addresses memory bottlenecks in large language models by compressing the key-value cache needed for autoregressive generation. The approach combines a hybrid storage policy with online subspace adaptation to overcome limitations of static compression methods. OjaKV preserves crucial first and most recent tokens at full rank to maintain high-fidelity anchors for attention mechanisms. This innovation responds to the substantial memory demands of models like Llama-3.1-8B, which requires approximately 16GB for its KV cache when processing 32K-token prompts at batch size 4—exceeding the model's own weight storage. Traditional low-rank projection methods suffer from poor performance under data distribution shifts due to their reliance on offline-learned subspaces. The framework was detailed in arXiv preprint 2509.21623v2, which announced a replacement cross-version. By strategically determining which tokens to compress, OjaKV enables more efficient long-context processing while maintaining model accuracy.

Key facts

OjaKV is a novel framework for key-value cache compression in large language models
It uses a hybrid storage policy combined with online subspace adaptation
The framework preserves first and most recent tokens at full rank as attention anchors
Llama-3.1-8B requires approximately 16GB for KV cache with 32K-token prompts at batch size 4
This KV cache size exceeds the model's own weight storage requirements
Existing compression methods rely on static, offline-learned subspaces
Static methods perform poorly under data distribution shifts
The research was published as arXiv preprint 2509.21623v2 with replace-cross announcement type

Entities

—

Sources

arXiv cs.AI — 2026-04-20