New Harmonic Mean Operator for Average Reward RL in SMDPs
A new research paper introduces a modified harmonic mean operator for average reward reinforcement learning in semi-Markov decision processes (SMDPs). The operator correctly computes reward rates even when rewards and durations are non-stationary over an infinite horizon, addressing a flaw in existing ratio-based algorithms. The paper proves theoretical properties and demonstrates empirical results. The work is relevant to continuing, non-episodic tasks and offers model-free learning algorithms robust to changing distributions.
Key facts
- arXiv:2605.04880v1
- Announce Type: cross
- Focus on undiscounted average reward RL in infinite-horizon, non-episodic tasks
- SMDPs involve discrete actions generating stochastic rewards and durations
- Objective is to optimize average reward rate
- Existing ratio-based algorithms can be incorrect under non-stationary conditions
- Paper presents a novel modified harmonic mean operator
- Operator correctly computes reward rates under non-stationarity
- Yields model-free learning algorithms for SMDPs
- Theoretical properties are proven
- Empirical demonstration is included
Entities
—