ODRPO: Ordinal Decompositions for Robust Policy Optimization in LLM Alignment
A new method called Ordinal Decomposition for Robust Policy Optimization (ODRPO) addresses reward noise in Reinforcement Learning from AI Feedback (RLAIF) for Large Language Models (LLMs). RLAIF uses LLM-based auto-raters to provide multi-tier discrete rewards (e.g., 1-10 rubrics) for non-verifiable domains like long-form question answering and open-ended instruction following. However, these auto-raters are inherently stochastic due to prompt sensitivity and sampling randomness, which can corrupt standard advantage estimators like GRPO and MaxRL. Noisy reward samples skew normalization statistics and degrade the global learning signal. While sampling more rewards and taking majority voting reduces noise, it is computationally expensive. ODRPO decomposes ordinal rewards to improve robustness without heavy computation. The paper is available on arXiv under ID 2605.12667.
Key facts
- ODRPO stands for Ordinal Decomposition for Robust Policy Optimization.
- The method targets reward noise in Reinforcement Learning from AI Feedback (RLAIF).
- RLAIF is used for non-verifiable domains like long-form question answering.
- Auto-raters provide multi-tier discrete rewards (e.g., 1-10 rubrics).
- Stochasticity arises from prompt sensitivity and sampling randomness.
- Noisy rewards corrupt standard advantage estimators like GRPO and MaxRL.
- Majority voting reduces noise but is computationally expensive.
- The paper is on arXiv with ID 2605.12667.
Entities
Institutions
- arXiv