SpecKV: Adaptive Speculative Decoding for LLMs
A team of researchers has introduced SpecKV, an efficient adaptive controller that determines the speculation length gamma for each step in the speculative decoding process of large language models. This technique employs a smaller draft model to suggest potential tokens, which are then validated by a larger target model, thus speeding up inference. The ideal gamma value is influenced by the type of task and the level of compression on the target model. The research examined speculative decoding across four task categories, four speculation lengths, and three compression levels (FP16, INT8, NF4), gathering 5,112 records that included per-step acceptance rates, draft entropy, and draft confidence. SpecKV enhances efficiency by using signals from the draft model to adjust gamma dynamically, surpassing fixed-gamma methods.
Key facts
- SpecKV is an adaptive controller for selecting speculation length gamma in speculative decoding.
- Speculative decoding uses a small draft model to propose candidate tokens for a larger target model.
- Gamma determines how many tokens the draft model proposes per step.
- Optimal gamma varies across task types and compression levels.
- Study profiled 4 task categories, 4 speculation lengths, 3 compression levels.
- 5,112 step-level records were collected.
- Records include per-step acceptance rates, draft entropy, and draft confidence.
- SpecKV uses signals from the draft model to adjust gamma dynamically.
Entities
—