PAD-Rec: Accelerating LLM-Based Generative Recommendation

other · 2026-05-01

A new method called PAD-Rec (Position-Aware Drafting for generative Recommendation) improves inference speed in large language model (LLM)-based generative list-wise recommendation. The technique addresses limitations of standard speculative decoding (SD), which uses a small draft model to propose multiple tokens and a target LLM to verify them. In recommendation tasks, items are represented by semantic-ID tokens with separators, and token semantics depend on their position within an item slot. Uncertainty also increases with speculation depth. PAD-Rec augments the draft model with position-aware signals to account for these factors, achieving greater speedups without altering the target distribution. The work is published on arXiv under ID 2604.27747.

Key facts

PAD-Rec is a position-aware drafting module for generative recommendation.
It accelerates inference in LLM-based list-wise recommendation.
Standard speculative decoding treats tokens uniformly, ignoring position-dependent semantics.
PAD-Rec models token slot position and uncertainty growth with depth.
The method does not change the target distribution.
It is designed for generative recommendation using semantic-ID tokens.
The paper is available on arXiv with ID 2604.27747.
The approach aims to reduce latency in sequential decoding.

PAD-Rec: Accelerating LLM-Based Generative Recommendation

Key facts

Entities

Institutions

Sources