ARTFEED — Contemporary Art Intelligence

Behavior-Consistent Deep Reinforcement Learning: A New Framework

other · 2026-05-22

A new paper on arXiv (2605.21214v2) formalizes behavior-consistent reinforcement learning to address cross-run policy divergence. The authors propose using maximum-entropy RL to anchor training runs to a common uniform prior, proving that temperature proportional to Q-function disagreement bounds pairwise KL divergence for Boltzmann policies. They caution that naively increasing entropy may impair optimization and amplify off-policy error, and introduce Q-value Expectile Disagreement as a solution.

Key facts

  • Paper arXiv:2605.21214v2
  • Announce type: cross
  • Addresses cross-run policy divergence in RL
  • Formalizes behavior-consistent RL
  • Uses maximum-entropy RL with uniform prior
  • Proves temperature proportional to Q-function disagreement bounds KL divergence
  • Warns against naive entropy increase
  • Proposes Q-value Expectile Disagreement

Entities

Institutions

  • arXiv

Sources