Behavior-Consistent Deep Reinforcement Learning: A New Framework
A new paper on arXiv (2605.21214v2) formalizes behavior-consistent reinforcement learning to address cross-run policy divergence. The authors propose using maximum-entropy RL to anchor training runs to a common uniform prior, proving that temperature proportional to Q-function disagreement bounds pairwise KL divergence for Boltzmann policies. They caution that naively increasing entropy may impair optimization and amplify off-policy error, and introduce Q-value Expectile Disagreement as a solution.
Key facts
- Paper arXiv:2605.21214v2
- Announce type: cross
- Addresses cross-run policy divergence in RL
- Formalizes behavior-consistent RL
- Uses maximum-entropy RL with uniform prior
- Proves temperature proportional to Q-function disagreement bounds KL divergence
- Warns against naive entropy increase
- Proposes Q-value Expectile Disagreement
Entities
Institutions
- arXiv