ReCode: Enhancing Code Generation via Reasoning-Process Rewards
Researchers propose ReCode, a reinforcement learning framework for code generation that optimizes reasoning quality. It addresses two challenges: scarcity of fine-grained preference data for training reward models and risk of reward hacking. ReCode includes Contrastive Reasoning-Process Reward Learning (CRPL) to train a reward model using synthesized reasoning variants, and Consistency-Gated GRPO (CG-GRPO) to integrate reasoning-process rewards with execution outcomes. The work is detailed in arXiv paper 2508.05170.
Key facts
- ReCode stands for Reasoning-Reinforced Code Generation.
- It uses Contrastive Reasoning-Process Reward Learning (CRPL).
- CRPL trains a reward model with synthesized optimized and degraded reasoning variants.
- Consistency-Gated GRPO (CG-GRPO) gates neural reasoning-process rewards with execution outcomes.
- The framework aims to improve code generation by optimizing reasoning quality.
- It addresses scarcity of fine-grained preference data for reward model training.
- It mitigates reward hacking by integrating execution outcomes.
- The paper is available on arXiv with ID 2508.05170.
Entities
Institutions
- arXiv