Conditional Optimal Transport Calibrates Process Reward Models
A new method using conditional optimal transport (CondOT) improves calibration of Process Reward Models (PRMs) for inference-time scaling. The approach modifies CondOT map learning to estimate a monotonic conditional quantile function over success probabilities, conditioned on PRM hidden states, yielding structurally valid quantile estimates and confidence bounds at arbitrary levels. Integrated into the instance-adaptive scaling (IAS) framework, it is evaluated on MATH-500 and AIME benchmarks, showing substantial calibration improvements over uncalibrated PRMs and quantile regression when PRMs have reliable ranking signals.
Key facts
- arXiv:2605.06785v1 is a cross-type abstract.
- PRMs are often poorly calibrated and overestimate success probabilities.
- First use of conditional optimal transport for calibrating PRMs.
- Method modifies CondOT map learning from bunne2022supervised.
- Estimates a monotonic conditional quantile function over success probabilities.
- Conditioned on PRM hidden states.
- Yields structurally valid quantile estimates and confidence bounds.
- Integrated into IAS framework from park2025know.
- Evaluated on MATH-500 and AIME benchmarks.
- Substantially improves calibration over uncalibrated PRMs and quantile regression.
Entities
Institutions
- arXiv