Policy Gradient Failure Modes in Long-Horizon Cumulative-Damage Problems

publication · 2026-05-27

A recent paper on arXiv (2605.26657) reveals two distinct failure modes for policy-gradient techniques in long-horizon decision-making scenarios involving cumulative damage: completion (reaching the end of the horizon) and optimality (aligning with dynamic programming). When using PPO with a linear soft penalty, allowing horizon access diminishes the completion rate, as the equilibrium of the penalty drives the dominant-activity share to zero. Although restricting the action space while permitting horizon access ensures completion, it results in an optimality gap (ΔM_final = 0.271), linked to initial-phase greedy commitment at the damage origin. The authors propose four testable predictions and assess them across two independently calibrated environments with a common abstract structure.

Key facts

Paper arXiv:2605.26657 identifies completion and optimality as two failure modes for policy-gradient methods in cumulative-damage problems.
Under PPO with linear soft penalty, horizon access alone reduces completion rate.
Action-space restriction with horizon access achieves completion but leaves optimality gap of 0.271.
Optimality gap traced to first-phase greedy commitment at damage origin.
Four testable predictions derived and evaluated in two calibrated environments.
Environments share same abstract structure but are separately calibrated.
Cumulative-damage problems couple locally attractive actions to globally adverse outcomes.
Paper proposes a decomposition separating completion and optimality.

Policy Gradient Failure Modes in Long-Horizon Cumulative-Damage Problems

Key facts

Entities

Institutions

Sources