Counterfactual Reasoning Reduces Credit Assignment Variance in LLM Reinforcement Learning
A new framework addresses the credit assignment problem in reinforcement learning for multi-step reasoning with large language models. Sparse terminal rewards cause high gradient variance and unstable training. The proposed method samples multiple reasoning trajectories under the same input, using their differences as an implicit approximation of alternative decisions to construct a step-sensitive advantage estimator. This transforms sparse terminal rewards into process-level learning signals. The resulting algorithm, Implicit Behavior Policy Optimization (IBPO), improves training stability and performance upper bounds on mathematical reasoning tasks.
Key facts
- Reinforcement learning for multi-step reasoning with LLMs relies on sparse terminal rewards.
- Sparse terminal rewards lead to poor credit assignment and high gradient variance.
- The framework samples multiple reasoning trajectories under the same input.
- Differences between trajectories approximate alternative decisions.
- An implicit process-level advantage estimator is constructed.
- The algorithm is called Implicit Behavior Policy Optimization (IBPO).
- IBPO improves training stability and performance upper bounds.
- The work is published on arXiv with ID 2605.16302.
Entities
Institutions
- arXiv