KVBuffer: IO-Aware Serving for Linear Attention

other · 2026-05-20

The paper arXiv:2605.19049 presents KVBuffer, a mechanism designed for IO-aware serving of linear attention. Linear attention is increasingly preferred for long-context inference due to its consistent decoding cost regardless of context length. However, current serving systems repeatedly calculate and refresh a substantial linear attention state with each decoding step, leading to significant memory access issues and inefficiencies. KVBuffer addresses this by storing recent keys and values, which allows for more adaptable and efficient computations. It facilitates chunkwise decoding, minimizing average memory access and latency by postponing state updates and processing them in batches. Additionally, KVBuffer conducts parallel verification of draft tokens during speculative decoding, addressing a major limitation in serving linear attention models.

Key facts

Linear attention has constant decoding cost with respect to context length.
Existing serving systems recurrently compute and update a large linear attention state.
The state is much larger than per-token key and value.
Recurrent decoding incurs substantial memory access.
KVBuffer buffers recent keys and values.
KVBuffer enables chunkwise computation for decoding.
Chunkwise computation defers state updates and applies them in batch.
KVBuffer verifies draft tokens in parallel for speculative decoding.

Entities

—

Sources

arXiv cs.AI — 2026-05-20