BetaPRM: Distributional Process Reward Model for Reliable Step-Level Feedback

other · 2026-05-18

A recent study published on arXiv introduces BetaPRM, a distributional Process Reward Model designed to forecast both the probability of success at each step and the dependability of that forecast. Existing PRMs produce a singular reward score for each step, which subsequent methods often assume to be accurate despite potential flaws. In contrast, BetaPRM employs a Beta-Binomial likelihood to derive a Beta belief from Monte Carlo continuations, offering a reliability indicator that helps determine when a step reward is trustworthy. This advancement allows for applications such as Adaptive Computation Allocation to differentiate between reliable and uncertain rewards.

Key facts

BetaPRM is a distributional Process Reward Model.
It predicts step-level success probability and prediction reliability.
Current PRMs output only a single reward score per step.
BetaPRM uses a Beta-Binomial likelihood from Monte Carlo continuations.
The reliability signal indicates when a step reward should be trusted.
One application is Adaptive Computation Allocation.
The paper is on arXiv with ID 2605.15529.
The announcement type is cross.

BetaPRM: Distributional Process Reward Model for Reliable Step-Level Feedback

Key facts

Entities

Institutions

Sources