Differential TD Methods Extended to Episodic Reinforcement Learning

other · 2026-05-07

A new research paper proposes an extension of differential temporal difference (TD) methods to episodic reinforcement learning problems. Differential TD methods, which rely on reward centering by the average reward, were previously limited to infinite-horizon settings because reward centering can alter the optimal policy in episodic tasks. The authors prove that their generalized differential TD maintains policy ordering under termination, thus enabling its use in episodic problems. They also show equivalence with a form of linear TD, inheriting theoretical guarantees. The work is motivated by recent studies on normalization in streaming deep reinforcement learning. The paper is available on arXiv under identifier 2605.04368.

Key facts

Differential TD methods are value-based RL algorithms for infinite-horizon problems.
Reward centering keeps returns bounded and removes state-independent offset.
Reward centering can alter the optimal policy in episodic problems.
The proposed generalization maintains policy ordering under termination.
The method is shown equivalent to a form of linear TD.
The work is motivated by normalization in streaming deep RL.
The paper is available on arXiv:2605.04368.
The research extends differential TD to episodic problems.

Differential TD Methods Extended to Episodic Reinforcement Learning

Key facts

Entities

Institutions

Sources