Credit-Assigned Policy Gradient Improves Early-Stage Retrieval in Two-Stage Ranking

other · 2026-05-27

A new reinforcement learning method, credit-assigned policy gradient (CA-PG), addresses the scalability challenge of training early-stage rankers (ESR) in two-stage ranking systems used by large-scale search, recommendation, and retrieval-augmented generation (RAG) systems. The standard vanilla policy gradient (V-PG) suffers from exploding variance when applied to candidate-set sizes relevant for practical use, because it propagates gradients to the joint probability of candidate sets rather than to individual items. CA-PG computes gradients with respect to the marginal probability that a target item appears in any candidate set, thereby reducing variance. The approach is detailed in arXiv:2605.26385v1.

Key facts

Two-stage ranking systems consist of an early-stage ranker (ESR) and a late-stage ranker (LSR).
ESR generates a candidate set; LSR re-ranks it.
Vanilla policy gradient (V-PG) is not scalable for practical candidate-set sizes due to exploding variance.
V-PG propagates gradient to joint probability of candidate sets, ignoring item-level contributions.
Credit-assigned policy gradient (CA-PG) computes gradients w.r.t. marginal probability of target item being chosen.
CA-PG mitigates variance issues in ESR training.
The method is applicable to search, recommendation, and RAG systems.
The paper is available on arXiv with ID 2605.26385.

Credit-Assigned Policy Gradient Improves Early-Stage Retrieval in Two-Stage Ranking

Key facts

Entities

Institutions

Sources