ARTFEED — Contemporary Art Intelligence

ABPO Framework Tackles Bandit Feedback in Continual LLM Recommender Updates

ai-technology · 2026-05-20

To tackle exposure bias and feedback ambiguity in the ongoing updates of generative LLM-based recommenders (LLM-Rec), researchers have introduced the Anchored Bandit Policy Optimization (ABPO). The feedback from post-deployment logs is limited to policy-shaped contextual bandit signals, where results are only noted for items presented by a previous serving policy, resulting in incomplete and skewed information. ABPO merges group-relative policy optimization (GRPO) with a direct approach to these biases by integrating the exposed recommendation as a logged anchor within each GRPO rollout group, adjusting group-relative normalization based on the actions of the prior policy. The study can be found on arXiv with the identifier 2605.18899.

Key facts

  • Generative LLM-based recommenders require continual post-deployment updates.
  • Deployment logs provide policy-shaped contextual bandit feedback.
  • Feedback includes exposure bias and ambiguous no-responses.
  • ABPO framework combines GRPO with explicit bias treatment.
  • Exposed recommendation is used as a logged anchor in GRPO rollouts.
  • Group-relative normalization is calibrated against prior policy exposure.
  • Paper available on arXiv:2605.18899.
  • Announce type is cross.

Entities

Institutions

  • arXiv

Sources