Verifiable Process Supervision Improves Language Model Reasoning
A team of researchers has introduced verifiable process supervision (VPS), a framework designed to enhance both prediction accuracy and reasoning quality in language models after training. In contrast to reinforcement learning that focuses solely on end results and risks deteriorating reasoning, VPS employs supervised fine-tuning to foster structured reasoning formats. This approach allows for the syntactic extraction of intermediate claims, which are then assessed against ground-truth signals for process-level rewards. Additionally, an adaptive reward weighting system emphasizes components with the highest remaining errors, effectively creating an implicit curriculum. The method has been tested in the domain of chess. The full paper can be found on arXiv.
Key facts
- VPS jointly optimizes prediction accuracy and reasoning quality.
- Standard RL with verifiable rewards can degrade reasoning while improving task accuracy.
- Supervised fine-tuning induces a structured reasoning format.
- Process-level rewards are derived from intermediate claims evaluated against ground-truth signals.
- Adaptive reward weighting prioritizes components with largest errors.
- Evaluation is performed on chess.
- The paper is available on arXiv.
- arXiv ID: 2605.12519
Entities
Institutions
- arXiv