SPEED: Layer-Asymmetric KV Cache for Efficient Long-Context LLM Inference
Researchers have unveiled Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility strategy tailored for decoder-only language models. Unlike previous methods that aimed to reduce the storage or construction costs of upper-layer prompt KV states, SPEED generates non-anchor prompt-token KV states solely in the lower layers during the Prefill phase, entirely excluding prefill tokens from the upper-layer Decode visibility set. In a controlled study involving Llama-3.1-8B instruction tuning, SPEED, which utilized just 75% of layers for prefill tokens, recorded an average score of 51.2 on OLMES-style benchmarks, slightly below the 51.4 of the full-depth baseline, while enhancing TTFT by 33% and TPOT by an undisclosed amount. This method effectively lowers long-context inference costs with negligible quality degradation.
Key facts
- SPEED is a phase-asymmetric KV-visibility policy for decoder-only language models.
- It materializes non-anchor prompt-token KV states only in lower layers during Prefill.
- Decode-phase tokens retain full-depth KV visibility.
- SPEED removes prefill tokens from the upper-layer Decode visibility set.
- Tested on Llama-3.1-8B with instruction tuning.
- Using 75% of layers for prefill tokens, SPEED scored 51.2 on OLMES-style benchmarks.
- Full-depth baseline scored 51.4 on the same benchmarks.
- SPEED improved TTFT by 33%.
Entities
—