Controllable Synthesis of Process Supervision Data for Reward Models
A novel approach for synthesizing process supervision data aimed at process reward models (PRMs) has been introduced. This technique involves creating a valid symbolic reasoning sequence, introducing a template-aware error at a specific point, recalculating the following steps based on this altered state, and confirming that the erroneous step cannot be derived from its preceding steps. The generated paired trajectories exhibit prefix-invalidity at the initial error while maintaining trajectory consistency post-symbolic recalculation. These trajectories are then converted into aligned natural-language processes for the training and assessment of PRMs. Experimental results indicate that the synthesized data enhance Best-of-8 reranking in logical reasoning tests and are applicable to mathematical reasoning, with step-level evaluations further supporting the method.
Key facts
- Framework constructs correct symbolic reasoning chains
- Injects template-aware errors into intermediate steps
- Recomputes subsequent steps under corrupted state
- Verifies injected step is not derivable from its prefix
- Paired trajectories are prefix-invalid at first error
- Trajectories remain consistent after symbolic recomputation
- Data improves Best-of-8 reranking on logical reasoning
- Transfers to mathematical reasoning tasks
Entities
Institutions
- arXiv