BetaPRM: Distributional Process Reward Model for Reliable Step-Level Feedback
A recent study published on arXiv introduces BetaPRM, a distributional Process Reward Model designed to forecast both the probability of success at each step and the dependability of that forecast. Existing PRMs produce a singular reward score for each step, which subsequent methods often assume to be accurate despite potential flaws. In contrast, BetaPRM employs a Beta-Binomial likelihood to derive a Beta belief from Monte Carlo continuations, offering a reliability indicator that helps determine when a step reward is trustworthy. This advancement allows for applications such as Adaptive Computation Allocation to differentiate between reliable and uncertain rewards.
Key facts
- BetaPRM is a distributional Process Reward Model.
- It predicts step-level success probability and prediction reliability.
- Current PRMs output only a single reward score per step.
- BetaPRM uses a Beta-Binomial likelihood from Monte Carlo continuations.
- The reliability signal indicates when a step reward should be trusted.
- One application is Adaptive Computation Allocation.
- The paper is on arXiv with ID 2605.15529.
- The announcement type is cross.
Entities
Institutions
- arXiv