Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

other · 2026-05-11

A new arXiv preprint (2605.07276) proposes signal reshaping for Group Relative Policy Optimization (GRPO) in code-agent reinforcement learning, specifically for agentic compile-fix tasks. The authors argue that GRPO's within-group comparison is only meaningful after reshaping three types of signals: outcome rewards for semantic ranking, process signals for intra-trajectory credit assignment, and rollouts for execution comparability. They introduce a minimal construction with compile-and-semantic layered rewards, step-level process scores outside group reward normalization, and failure-cause-aware rollout governance, leaving GRPO's group-normalized advantage unchanged. The work addresses weak feedback where rollout-time signals are reliable but only capture necessary or surface conditions.

Key facts

arXiv:2605.07276v1
Announce type: new
Abstract discusses code-agent RL with weak feedback
Setting: agentic compile-fix
Signal reshaping for standard GRPO
Three signal types: outcome rewards, process signals, rollouts
Compile-and-semantic layered rewards
Step-level process scores outside group reward normalization
Failure-cause-aware rollout governance
GRPO's group-normalized advantage construction unchanged

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Key facts

Entities

Institutions

Sources