ARTFEED — Contemporary Art Intelligence

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

other · 2026-05-11

A new arXiv preprint (2605.07276) proposes signal reshaping for Group Relative Policy Optimization (GRPO) in code-agent reinforcement learning, specifically for agentic compile-fix tasks. The authors argue that GRPO's within-group comparison is only meaningful after reshaping three types of signals: outcome rewards for semantic ranking, process signals for intra-trajectory credit assignment, and rollouts for execution comparability. They introduce a minimal construction with compile-and-semantic layered rewards, step-level process scores outside group reward normalization, and failure-cause-aware rollout governance, leaving GRPO's group-normalized advantage unchanged. The work addresses weak feedback where rollout-time signals are reliable but only capture necessary or surface conditions.

Key facts

  • arXiv:2605.07276v1
  • Announce type: new
  • Abstract discusses code-agent RL with weak feedback
  • Setting: agentic compile-fix
  • Signal reshaping for standard GRPO
  • Three signal types: outcome rewards, process signals, rollouts
  • Compile-and-semantic layered rewards
  • Step-level process scores outside group reward normalization
  • Failure-cause-aware rollout governance
  • GRPO's group-normalized advantage construction unchanged

Entities

Institutions

  • arXiv

Sources