Credit-Assigned Policy Gradient Improves Early-Stage Retrieval in Two-Stage Ranking
A new reinforcement learning method, credit-assigned policy gradient (CA-PG), addresses the scalability challenge of training early-stage rankers (ESR) in two-stage ranking systems used by large-scale search, recommendation, and retrieval-augmented generation (RAG) systems. The standard vanilla policy gradient (V-PG) suffers from exploding variance when applied to candidate-set sizes relevant for practical use, because it propagates gradients to the joint probability of candidate sets rather than to individual items. CA-PG computes gradients with respect to the marginal probability that a target item appears in any candidate set, thereby reducing variance. The approach is detailed in arXiv:2605.26385v1.
Key facts
- Two-stage ranking systems consist of an early-stage ranker (ESR) and a late-stage ranker (LSR).
- ESR generates a candidate set; LSR re-ranks it.
- Vanilla policy gradient (V-PG) is not scalable for practical candidate-set sizes due to exploding variance.
- V-PG propagates gradient to joint probability of candidate sets, ignoring item-level contributions.
- Credit-assigned policy gradient (CA-PG) computes gradients w.r.t. marginal probability of target item being chosen.
- CA-PG mitigates variance issues in ESR training.
- The method is applicable to search, recommendation, and RAG systems.
- The paper is available on arXiv with ID 2605.26385.
Entities
Institutions
- arXiv