PAD-Rec: Accelerating LLM-Based Generative Recommendation
A new method called PAD-Rec (Position-Aware Drafting for generative Recommendation) improves inference speed in large language model (LLM)-based generative list-wise recommendation. The technique addresses limitations of standard speculative decoding (SD), which uses a small draft model to propose multiple tokens and a target LLM to verify them. In recommendation tasks, items are represented by semantic-ID tokens with separators, and token semantics depend on their position within an item slot. Uncertainty also increases with speculation depth. PAD-Rec augments the draft model with position-aware signals to account for these factors, achieving greater speedups without altering the target distribution. The work is published on arXiv under ID 2604.27747.
Key facts
- PAD-Rec is a position-aware drafting module for generative recommendation.
- It accelerates inference in LLM-based list-wise recommendation.
- Standard speculative decoding treats tokens uniformly, ignoring position-dependent semantics.
- PAD-Rec models token slot position and uncertainty growth with depth.
- The method does not change the target distribution.
- It is designed for generative recommendation using semantic-ID tokens.
- The paper is available on arXiv with ID 2604.27747.
- The approach aims to reduce latency in sequential decoding.
Entities
Institutions
- arXiv