ARTFEED — Contemporary Art Intelligence

Conditional Optimal Transport Calibrates Process Reward Models

ai-technology · 2026-05-11

A new method using conditional optimal transport (CondOT) improves calibration of Process Reward Models (PRMs) for inference-time scaling. The approach modifies CondOT map learning to estimate a monotonic conditional quantile function over success probabilities, conditioned on PRM hidden states, yielding structurally valid quantile estimates and confidence bounds at arbitrary levels. Integrated into the instance-adaptive scaling (IAS) framework, it is evaluated on MATH-500 and AIME benchmarks, showing substantial calibration improvements over uncalibrated PRMs and quantile regression when PRMs have reliable ranking signals.

Key facts

  • arXiv:2605.06785v1 is a cross-type abstract.
  • PRMs are often poorly calibrated and overestimate success probabilities.
  • First use of conditional optimal transport for calibrating PRMs.
  • Method modifies CondOT map learning from bunne2022supervised.
  • Estimates a monotonic conditional quantile function over success probabilities.
  • Conditioned on PRM hidden states.
  • Yields structurally valid quantile estimates and confidence bounds.
  • Integrated into IAS framework from park2025know.
  • Evaluated on MATH-500 and AIME benchmarks.
  • Substantially improves calibration over uncalibrated PRMs and quantile regression.

Entities

Institutions

  • arXiv

Sources