ARTFEED — Contemporary Art Intelligence

Researchers Develop Rubric-Based Generative Reward Model for Fine-Tuning AI Software Engineering Agents

ai-technology · 2026-04-22

A new research paper introduces a rubric-based Generative Reward Model (GRM) designed to improve the training of Large Language Model (LLM) agents for Software Engineering (SWE) tasks. Current end-to-end fine-tuning methods primarily rely on verifiable terminal rewards, such as whether unit tests pass, which offer binary signals about final solution correctness but provide minimal guidance for shaping intermediate behaviors during multi-step interactions. This limitation restricts enhancements in the overall quality of the resolution process. The proposed GRM incorporates human-designed rubrics that specify criteria for encouraging or discouraging particular behavioral patterns, delivering richer learning signals. This feedback is leveraged for high-quality training data collection through trajectory filtration. When applied to Reinforced Fine-Tuning (RFT) on SWE tasks, the approach outperforms methods using only terminal-score-based rejection sampling by more effectively suppressing undesirable behaviors and improving the training process. The research addresses a gap in current AI agent training by focusing on intermediate behavioral guidance rather than solely on final outcomes. The paper is available on arXiv under the identifier arXiv:2604.16335v1 and is categorized as a cross-announcement type. This work contributes to advancing AI capabilities in software engineering by refining how agents learn and interact during complex problem-solving tasks.

Key facts

  • A rubric-based Generative Reward Model (GRM) was developed for fine-tuning LLM agents in Software Engineering tasks.
  • Current fine-tuning methods rely on verifiable terminal rewards like unit test passes, offering limited guidance for intermediate behaviors.
  • The GRM uses human-designed rubrics to encourage or discourage specific behavioral patterns.
  • Feedback from the GRM is used for high-quality training data collection via trajectory filtration.
  • The approach outperforms terminal-score-only rejection sampling in Reinforced Fine-Tuning on SWE tasks.
  • The research aims to improve the overall quality of the resolution process by shaping intermediate behaviors.
  • The paper is published on arXiv with the identifier arXiv:2604.16335v1.
  • The announcement type is cross, indicating it spans multiple categories or fields.

Entities

Institutions

  • arXiv

Sources