OPPO: Bayesian Token-Level Credit Assignment for LLM Reasoning
A new reinforcement learning method for large language models, Oracle-Prompted Policy Optimization (OPPO), addresses the credit assignment problem in token-level reasoning. Unlike GRPO, which assigns a single trajectory-level advantage to all tokens, OPPO uses a Bayesian update of the model's belief about eventual success to provide per-token signals. This approach accumulates oracle signals along a trajectory to estimate success probability at each position, requiring only one extra forward pass. The method improves upon prior distillation-style techniques by integrating local discrimination with trajectory-level evidence.
Key facts
- OPPO is proposed for token-level credit assignment in LLM reasoning.
- GRPO assigns a single trajectory-level advantage to every token.
- Prior critic-free methods use oracle-conditioned likelihood ratios for per-token signals.
- OPPO uses a Bayesian update of the model's belief about eventual success.
- The method accumulates oracle signals along a trajectory.
- It estimates success probability at every position in closed form.
- OPPO requires one extra forward pass.
- The approach combines local discrimination with trajectory-level evidence.
Entities
Institutions
- arXiv