ARTFEED — Contemporary Art Intelligence

POW3R: Policy-Aware Rubric Rewards for RLVR

ai-technology · 2026-05-20

A new arXiv preprint (2605.20164) introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). The authors argue that standard static rubric aggregations conflate human-assigned importance with optimization usefulness, as criteria may be saturated or unreachable. POW3R preserves human weights and category balance while adapting criterion-level reward weights during training using rollout-level contrast. This addresses the limitation that criteria distinguishing rollouts are not necessarily those with largest human weights.

Key facts

  • arXiv paper 2605.20164 introduces POW3R
  • POW3R is a policy-aware rubric reward framework
  • It addresses static rubric aggregation issues in RLVR
  • Standard aggregations conflate human importance with optimization signal
  • POW3R preserves human weights and category balance
  • It adapts criterion-level reward weights during training
  • Uses rollout-level contrast for weight adaptation
  • Published on arXiv in 2025

Entities

Institutions

  • arXiv

Sources