ARTFEED — Contemporary Art Intelligence

SPEED: Layer-Asymmetric KV Cache for Efficient Long-Context LLM Inference

ai-technology · 2026-05-09

Researchers have unveiled Shallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility strategy tailored for decoder-only language models. Unlike previous methods that aimed to reduce the storage or construction costs of upper-layer prompt KV states, SPEED generates non-anchor prompt-token KV states solely in the lower layers during the Prefill phase, entirely excluding prefill tokens from the upper-layer Decode visibility set. In a controlled study involving Llama-3.1-8B instruction tuning, SPEED, which utilized just 75% of layers for prefill tokens, recorded an average score of 51.2 on OLMES-style benchmarks, slightly below the 51.4 of the full-depth baseline, while enhancing TTFT by 33% and TPOT by an undisclosed amount. This method effectively lowers long-context inference costs with negligible quality degradation.

Key facts

  • SPEED is a phase-asymmetric KV-visibility policy for decoder-only language models.
  • It materializes non-anchor prompt-token KV states only in lower layers during Prefill.
  • Decode-phase tokens retain full-depth KV visibility.
  • SPEED removes prefill tokens from the upper-layer Decode visibility set.
  • Tested on Llama-3.1-8B with instruction tuning.
  • Using 75% of layers for prefill tokens, SPEED scored 51.2 on OLMES-style benchmarks.
  • Full-depth baseline scored 51.4 on the same benchmarks.
  • SPEED improved TTFT by 33%.

Entities

Sources