Survey of Process Reward Models for LLM Reasoning Alignment
A recent paper published on arXiv provides a comprehensive overview of Process Reward Models (PRMs), which assess and direct the reasoning of large language models at the step or trajectory level, in contrast to outcome reward models that only evaluate final results. This survey addresses the entire cycle: from generating process data to constructing PRMs and applying them for test-time scaling and reinforcement learning. It explores applications in mathematics, coding, text, multimodal reasoning, robotics, and agent development, while also discussing emerging benchmarks. The authors seek to clarify design spaces, highlight existing challenges, and steer future research towards achieving precise and robust reasoning alignment.
Key facts
- PRMs address the gap left by outcome reward models (ORMs) by evaluating reasoning at the step or trajectory level.
- The survey provides a systematic overview of PRMs through data generation, model building, and use for test-time scaling and reinforcement learning.
- Applications include math, code, text, multimodal reasoning, robotics, and agents.
- The paper reviews emerging benchmarks for PRMs.
- The goal is to clarify design spaces and guide future research.
- The paper is available on arXiv under Computer Science > Computation and Language.
- The submission history is included on the arXiv page.
- The survey emphasizes fine-grained, robust reasoning alignment.
Entities
Institutions
- arXiv