Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

other · 2026-05-16

The arXiv paper (2605.14982) explores the discounted reward framework within reinforcement learning (RL). Actor-critic techniques, which help address the difficulties of value approximation in policy gradient methods, generally utilize first-order updates and can converge to stationary points given appropriate conditions. While second-order optimization provides curvature-aware updates that can speed up convergence, its use in RL is often hindered by the complexity involved in estimating the Hessian. The authors investigate second-order approximations for the actor update that utilize comprehensive curvature data of the objective. They demonstrate that a stable approximation necessitates treating the action-value function as locally constant relative to policy parameters, a condition that is not typically valid in policy gradient methods. This approximation is better supported within a two-timescale framework.

Key facts

Paper addresses discounted reward setting in RL
Actor-critic methods mitigate value approximation challenges in policy gradient methods
First-order actor-critic methods converge to stationary points under suitable assumptions
Second-order optimization provides curvature-aware updates that accelerate convergence
Application of second-order methods in RL is limited by computational complexity of Hessian estimation
Authors analyze second-order approximations for the actor update using full curvature information
Stable approximation requires treating action-value function as locally constant with respect to policy parameters
Approximation becomes well-justified under a two-timescale framework

Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

Key facts

Entities

Institutions

Sources