New Algorithms for Dueling Bandits with Delayed Feedback
A team of researchers has officially defined the issue of Contextual Dueling Bandits with Stochastic Delayed Feedback, tackling a significant challenge in preference-driven decision-making frameworks, including recommender systems and LLM alignment. Traditional dueling bandit methods rely on immediate feedback, which is often not feasible in situations like prompt optimization. This delay can lead to bias, as dueling bandit estimators do not have closed-form solutions, rendering simple adaptations of weighting methods ineffective. To address this, the authors introduce two innovative algorithms: Linear Dueling Bandits with Delayed Feedback (LDB-DF) and Neural Dueling Bandits with Delayed Feedback (NDB-DF). Their strategy features a new estimator that incorporates an Inverse Probability Weighting (IPW) mechanism into the loss function, facilitating unbiased adjustments for delayed or absent feedback. This research is available on arXiv with the identifier 2605.26554.
Key facts
- Contextual dueling bandits are used in recommender systems and LLM alignment.
- Standard algorithms assume immediate feedback, which is often violated.
- Delayed feedback introduces bias due to lack of closed-form estimators.
- Two new algorithms proposed: LDB-DF and NDB-DF.
- A novel IPW-based estimator corrects for delayed or missing feedback.
- The problem is formalized as Contextual Dueling Bandits with Stochastic Delayed Feedback.
- The research is available on arXiv (2605.26554).
- The work addresses a unique theoretical challenge in dueling bandits.
Entities
Institutions
- arXiv