SPEED: Layer-Asymmetric KV Cache for Efficient Long-Context LLM Inference

ai-technology · 2026-05-09

Researchers have unveiled Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility strategy tailored for decoder-only language models. Unlike previous methods that aimed to reduce the storage or construction costs of upper-layer prompt KV states, SPEED generates non-anchor prompt-token KV states solely in the lower layers during the Prefill phase, entirely excluding prefill tokens from the upper-layer Decode visibility set. In a controlled study involving Llama-3.1-8B instruction tuning, SPEED, which utilized just 75% of layers for prefill tokens, recorded an average score of 51.2 on OLMES-style benchmarks, slightly below the 51.4 of the full-depth baseline, while enhancing TTFT by 33% and TPOT by an undisclosed amount. This method effectively lowers long-context inference costs with negligible quality degradation.

Key facts

SPEED is a phase-asymmetric KV-visibility policy for decoder-only language models.
It materializes non-anchor prompt-token KV states only in lower layers during Prefill.
Decode-phase tokens retain full-depth KV visibility.
SPEED removes prefill tokens from the upper-layer Decode visibility set.
Tested on Llama-3.1-8B with instruction tuning.
Using 75% of layers for prefill tokens, SPEED scored 51.2 on OLMES-style benchmarks.
Full-depth baseline scored 51.4 on the same benchmarks.
SPEED improved TTFT by 33%.

Entities

—

Sources

arXiv cs.AI — 2026-05-09