SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

publication · 2026-05-20

A recent arXiv paper (2605.17648) presents SAPO (Step-Aligned Policy Optimization) aimed at enhancing generative recommendation systems. This approach views next-item prediction as the generation of item identifiers in an autoregressive manner, utilizing semantic identifiers (SIDs) represented as concise token sequences. Previous studies have integrated reasoning traces optimized through reinforcement learning, employing outcome-reward algorithms that provide exact-match feedback on the generated SIDs. However, in scenarios with extensive catalogs, such feedback only indicates the correctness of the final item, failing to pinpoint which SID-token prediction led to discrepancies. The authors propose that credit assignment should be based on individual reasoning steps, aligning rewards accordingly. This paper is a preprint and has yet to undergo peer review.

Key facts

Paper arXiv:2605.17648 introduces SAPO for generative recommendation.
SAPO stands for Step-Aligned Policy Optimization.
Generative recommendation uses semantic identifiers (SIDs) as token sequences.
Outcome-reward with exact-match feedback cannot pinpoint mismatched tokens in large catalogs.
SAPO assigns step-level rewards to individual reasoning steps.
The paper is a preprint and not peer-reviewed.
Published on arXiv in 2025.
Authors are not named in the provided content.

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

Key facts

Entities

Institutions

Sources