ARTFEED — Contemporary Art Intelligence

Controllable Synthesis of Process Supervision Data for Reward Models

other · 2026-05-06

A novel approach for synthesizing process supervision data aimed at process reward models (PRMs) has been introduced. This technique involves creating a valid symbolic reasoning sequence, introducing a template-aware error at a specific point, recalculating the following steps based on this altered state, and confirming that the erroneous step cannot be derived from its preceding steps. The generated paired trajectories exhibit prefix-invalidity at the initial error while maintaining trajectory consistency post-symbolic recalculation. These trajectories are then converted into aligned natural-language processes for the training and assessment of PRMs. Experimental results indicate that the synthesized data enhance Best-of-8 reranking in logical reasoning tests and are applicable to mathematical reasoning, with step-level evaluations further supporting the method.

Key facts

  • Framework constructs correct symbolic reasoning chains
  • Injects template-aware errors into intermediate steps
  • Recomputes subsequent steps under corrupted state
  • Verifies injected step is not derivable from its prefix
  • Paired trajectories are prefix-invalid at first error
  • Trajectories remain consistent after symbolic recomputation
  • Data improves Best-of-8 reranking on logical reasoning
  • Transfers to mathematical reasoning tasks

Entities

Institutions

  • arXiv

Sources