ARTFEED — Contemporary Art Intelligence

New Harmonic Mean Operator for Average Reward RL in SMDPs

other · 2026-05-07

A new research paper introduces a modified harmonic mean operator for average reward reinforcement learning in semi-Markov decision processes (SMDPs). The operator correctly computes reward rates even when rewards and durations are non-stationary over an infinite horizon, addressing a flaw in existing ratio-based algorithms. The paper proves theoretical properties and demonstrates empirical results. The work is relevant to continuing, non-episodic tasks and offers model-free learning algorithms robust to changing distributions.

Key facts

  • arXiv:2605.04880v1
  • Announce Type: cross
  • Focus on undiscounted average reward RL in infinite-horizon, non-episodic tasks
  • SMDPs involve discrete actions generating stochastic rewards and durations
  • Objective is to optimize average reward rate
  • Existing ratio-based algorithms can be incorrect under non-stationary conditions
  • Paper presents a novel modified harmonic mean operator
  • Operator correctly computes reward rates under non-stationarity
  • Yields model-free learning algorithms for SMDPs
  • Theoretical properties are proven
  • Empirical demonstration is included

Entities

Sources