Controllable Synthesis of Process Supervision Data for Reward Models

other · 2026-05-06

A novel approach for synthesizing process supervision data aimed at process reward models (PRMs) has been introduced. This technique involves creating a valid symbolic reasoning sequence, introducing a template-aware error at a specific point, recalculating the following steps based on this altered state, and confirming that the erroneous step cannot be derived from its preceding steps. The generated paired trajectories exhibit prefix-invalidity at the initial error while maintaining trajectory consistency post-symbolic recalculation. These trajectories are then converted into aligned natural-language processes for the training and assessment of PRMs. Experimental results indicate that the synthesized data enhance Best-of-8 reranking in logical reasoning tests and are applicable to mathematical reasoning, with step-level evaluations further supporting the method.

Key facts

Framework constructs correct symbolic reasoning chains
Injects template-aware errors into intermediate steps
Recomputes subsequent steps under corrupted state
Verifies injected step is not derivable from its prefix
Paired trajectories are prefix-invalid at first error
Trajectories remain consistent after symbolic recomputation
Data improves Best-of-8 reranking on logical reasoning
Transfers to mathematical reasoning tasks

Controllable Synthesis of Process Supervision Data for Reward Models

Key facts

Entities

Institutions

Sources