Behavior-Consistent Deep Reinforcement Learning: A New Framework

other · 2026-05-22

A new paper on arXiv (2605.21214v2) formalizes behavior-consistent reinforcement learning to address cross-run policy divergence. The authors propose using maximum-entropy RL to anchor training runs to a common uniform prior, proving that temperature proportional to Q-function disagreement bounds pairwise KL divergence for Boltzmann policies. They caution that naively increasing entropy may impair optimization and amplify off-policy error, and introduce Q-value Expectile Disagreement as a solution.

Key facts

Paper arXiv:2605.21214v2
Announce type: cross
Addresses cross-run policy divergence in RL
Formalizes behavior-consistent RL
Uses maximum-entropy RL with uniform prior
Proves temperature proportional to Q-function disagreement bounds KL divergence
Warns against naive entropy increase
Proposes Q-value Expectile Disagreement

Behavior-Consistent Deep Reinforcement Learning: A New Framework

Key facts

Entities

Institutions

Sources