Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
A new arXiv preprint (2605.07276) proposes signal reshaping for Group Relative Policy Optimization (GRPO) in code-agent reinforcement learning, specifically for agentic compile-fix tasks. The authors argue that GRPO's within-group comparison is only meaningful after reshaping three types of signals: outcome rewards for semantic ranking, process signals for intra-trajectory credit assignment, and rollouts for execution comparability. They introduce a minimal construction with compile-and-semantic layered rewards, step-level process scores outside group reward normalization, and failure-cause-aware rollout governance, leaving GRPO's group-normalized advantage unchanged. The work addresses weak feedback where rollout-time signals are reliable but only capture necessary or surface conditions.
Key facts
- arXiv:2605.07276v1
- Announce type: new
- Abstract discusses code-agent RL with weak feedback
- Setting: agentic compile-fix
- Signal reshaping for standard GRPO
- Three signal types: outcome rewards, process signals, rollouts
- Compile-and-semantic layered rewards
- Step-level process scores outside group reward normalization
- Failure-cause-aware rollout governance
- GRPO's group-normalized advantage construction unchanged
Entities
Institutions
- arXiv