POW3R: Policy-Aware Rubric Rewards for RLVR
A new arXiv preprint (2605.20164) introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). The authors argue that standard static rubric aggregations conflate human-assigned importance with optimization usefulness, as criteria may be saturated or unreachable. POW3R preserves human weights and category balance while adapting criterion-level reward weights during training using rollout-level contrast. This addresses the limitation that criteria distinguishing rollouts are not necessarily those with largest human weights.
Key facts
- arXiv paper 2605.20164 introduces POW3R
- POW3R is a policy-aware rubric reward framework
- It addresses static rubric aggregation issues in RLVR
- Standard aggregations conflate human importance with optimization signal
- POW3R preserves human weights and category balance
- It adapts criterion-level reward weights during training
- Uses rollout-level contrast for weight adaptation
- Published on arXiv in 2025
Entities
Institutions
- arXiv