Dynamical Priors Improve Temporal Coherence in RL Training
A novel training framework named Dynamical Prior Reinforcement Learning (DP-RL) enhances policy gradient learning by incorporating an auxiliary loss derived from external state dynamics, which facilitate evidence accumulation and hysteresis. This technique influences the progression of action probabilities without altering the reward, environment, or policy structure. In three minimal environments, DP-RL consistently modifies decision-making paths, encouraging temporally structured actions that generic smoothing cannot account for. This approach tackles the challenge of temporally inconsistent behavior often seen in traditional RL, including sudden shifts in confidence, oscillations, or inactivity. The findings are detailed in arXiv preprint 2604.21464.
Key facts
- Standard RL optimizes policies for reward but imposes few constraints on decision evolution over time.
- Policies may achieve high performance while exhibiting temporally incoherent behavior.
- DP-RL introduces an auxiliary loss from external state dynamics.
- The framework implements evidence accumulation and hysteresis.
- No modifications are made to reward, environment, or policy architecture.
- Experiments were conducted across three minimal environments.
- Dynamical priors systematically alter decision trajectories in task-dependent ways.
- The results demonstrate that training objectives can shape temporal structure beyond generic smoothing.
Entities
Institutions
- arXiv