D-PACE: Dynamic Loss for Parallel Speculative Decoding in LLMs
A novel training approach known as D-PACE (Dynamic Position-Aware Cross-Entropy) enhances speculative decoding in large language models (LLMs). This technique speeds up inference by employing a smaller drafter model to suggest tokens, which a larger target model then verifies simultaneously. While recent diffusion-based drafters like DFlash can predict entire token blocks in a single forward pass, current multi-token objectives utilize static position-dependent weights that remain unchanged during training. D-PACE generates weights for each position based on a differentiable approximation of expected accepted draft length, directing the training focus towards positions that hinder acceptance. Testing with Qwen3-4B drafter models across six benchmarks revealed improved acceptance rates and faster inference. The paper can be found on arXiv under identifier 2605.18810.
Key facts
- D-PACE is a dynamic position-aware cross-entropy loss for speculative decoding.
- It addresses fixed weighting schedules in multi-token drafter objectives.
- Weights are derived from a differentiable surrogate of expected accepted draft length.
- Tested across six benchmarks with Qwen3-4B drafter models.
- Published on arXiv with ID 2605.18810.
- Related to diffusion-based parallel drafters like DFlash.
- Aims to accelerate LLM inference by improving token acceptance rates.
- Training signal shifts toward positions limiting acceptance as drafter improves.
Entities
Institutions
- arXiv