Rubric-Grounded RL Boosts LLM Reasoning with Structured Rewards
A novel approach known as rubric-grounded reinforcement learning (RL) breaks down rewards into weighted, verifiable criteria assessed by a static LLM judge, offering a signal for partial-credit optimization. Rather than relying on a binary result or a single overall score, each response is evaluated based on various task-specific criteria. The policy is refined using this structured, multi-criterion reward, while conditioning on auxiliary grounding that the policy does not access. This framework was developed using rubrics from an OSTI corpus containing approximately 100,000 scientific and technical documents, training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). The GRPO-optimized model reached a normalized reward of 71.7% on the held-out rubric evaluation and outperformed the base model on four reasoning benchmarks.
Key facts
- Rubric-grounded RL decomposes reward into weighted, verifiable criteria.
- A frozen LLM judge scores responses along multiple task-specific criteria.
- The policy is optimized against a structured, multi-criterion reward.
- Rubrics derived from an OSTI corpus of roughly 100,000 documents.
- Llama-3.1-8B-Instruct trained with GRPO.
- Model achieved 71.7% normalized reward on held-out rubric evaluation.
- GRPO-tuned policy improved over base model on four reasoning benchmarks.
Entities
Institutions
- Office of Scientific and Technical Information