Gradient Penalized Latent Dynamics Improves DreamerV3 Sample Efficiency
A novel technique named GPLD (Gradient-Penalized Latent Dynamics) improves the DreamerV3 reinforcement learning framework by ensuring local smoothness in the transition dynamics it learns. This method introduces a row-wise Jacobian penalty to the posterior latent distribution, resembling a continuous-latent version of finite-difference smoothing found in discrete embedded-state MDPs. The estimation of GPLD is performed efficiently through Hutchinson-style stochastic probes. Results from empirical tests on DeepMind Control proprioceptive tasks indicate enhanced overall sample efficiency, with notable advancements particularly evident in more complex locomotion scenarios, such as challenging quadruped tasks.
Key facts
- GPLD is a gradient-penalized latent dynamics regularizer for DreamerV3.
- It applies a row-wise Jacobian penalty to the posterior latent distribution.
- The penalty encourages locally smooth transition learning.
- It is estimated using Hutchinson-style stochastic probes.
- GPLD improves aggregate sample efficiency on DeepMind Control tasks.
- Strong gains are observed on higher-complexity locomotion environments.
- The method is tested on quadruped tasks.
- The paper is available on arXiv with ID 2605.23089.
Entities
Institutions
- arXiv