ABPO Framework Tackles Bandit Feedback in Continual LLM Recommender Updates

ai-technology · 2026-05-20

To tackle exposure bias and feedback ambiguity in the ongoing updates of generative LLM-based recommenders (LLM-Rec), researchers have introduced the Anchored Bandit Policy Optimization (ABPO). The feedback from post-deployment logs is limited to policy-shaped contextual bandit signals, where results are only noted for items presented by a previous serving policy, resulting in incomplete and skewed information. ABPO merges group-relative policy optimization (GRPO) with a direct approach to these biases by integrating the exposed recommendation as a logged anchor within each GRPO rollout group, adjusting group-relative normalization based on the actions of the prior policy. The study can be found on arXiv with the identifier 2605.18899.

Key facts

Generative LLM-based recommenders require continual post-deployment updates.
Deployment logs provide policy-shaped contextual bandit feedback.
Feedback includes exposure bias and ambiguous no-responses.
ABPO framework combines GRPO with explicit bias treatment.
Exposed recommendation is used as a logged anchor in GRPO rollouts.
Group-relative normalization is calibrated against prior policy exposure.
Paper available on arXiv:2605.18899.
Announce type is cross.

ABPO Framework Tackles Bandit Feedback in Continual LLM Recommender Updates

Key facts

Entities

Institutions

Sources