OPPO: Bayesian Token-Level Credit Assignment for LLM Reasoning

other · 2026-05-23

A new reinforcement learning method for large language models, Oracle-Prompted Policy Optimization (OPPO), addresses the credit assignment problem in token-level reasoning. Unlike GRPO, which assigns a single trajectory-level advantage to all tokens, OPPO uses a Bayesian update of the model's belief about eventual success to provide per-token signals. This approach accumulates oracle signals along a trajectory to estimate success probability at each position, requiring only one extra forward pass. The method improves upon prior distillation-style techniques by integrating local discrimination with trajectory-level evidence.

Key facts

OPPO is proposed for token-level credit assignment in LLM reasoning.
GRPO assigns a single trajectory-level advantage to every token.
Prior critic-free methods use oracle-conditioned likelihood ratios for per-token signals.
OPPO uses a Bayesian update of the model's belief about eventual success.
The method accumulates oracle signals along a trajectory.
It estimates success probability at every position in closed form.
OPPO requires one extra forward pass.
The approach combines local discrimination with trajectory-level evidence.

OPPO: Bayesian Token-Level Credit Assignment for LLM Reasoning

Key facts

Entities

Institutions

Sources