POW3R: Policy-Aware Rubric Rewards for RLVR

ai-technology · 2026-05-20

A new arXiv preprint (2605.20164) introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). The authors argue that standard static rubric aggregations conflate human-assigned importance with optimization usefulness, as criteria may be saturated or unreachable. POW3R preserves human weights and category balance while adapting criterion-level reward weights during training using rollout-level contrast. This addresses the limitation that criteria distinguishing rollouts are not necessarily those with largest human weights.

Key facts

arXiv paper 2605.20164 introduces POW3R
POW3R is a policy-aware rubric reward framework
It addresses static rubric aggregation issues in RLVR
Standard aggregations conflate human importance with optimization signal
POW3R preserves human weights and category balance
It adapts criterion-level reward weights during training
Uses rollout-level contrast for weight adaptation
Published on arXiv in 2025

POW3R: Policy-Aware Rubric Rewards for RLVR

Key facts

Entities

Institutions

Sources