Safe-Support Q-Learning Prevents Unsafe Exploration in RL
A new reinforcement learning framework called Safe-Support Q-Learning (SSQL) eliminates unsafe state visitation during training, addressing a critical challenge in real-world applications where hazardous exploration can cause catastrophic failures. Unlike conventional safe RL methods that merely mitigate risk through constraints or penalties while still permitting unsafe exploration, SSQL enforces a stricter safety requirement by ensuring all training trajectories remain within a predefined safe set. The framework employs a behavior policy supported solely on this safe set, enabling sufficient exploration without requiring near-optimal performance. It adopts a two-stage architecture where the Q-function and policy are trained separately, using a KL-regularized Bellman target to keep the Q-function close to the behavior policy. The approach is detailed in a preprint on arXiv (2604.25379), highlighting its potential for safety-critical domains such as autonomous driving, robotics, and healthcare.
Key facts
- Safe-Support Q-Learning (SSQL) eliminates unsafe state visitation during RL training.
- It uses a behavior policy supported on a safe set to ensure trajectories remain safe.
- The framework adopts a two-stage training process for Q-function and policy.
- A KL-regularized Bellman target constrains the Q-function to stay near the behavior policy.
- The method does not require near-optimal performance for safe exploration.
- It addresses safety in real-world applications like autonomous driving and robotics.
- The preprint is available on arXiv with ID 2604.25379.
- SSQL is stricter than conventional safe RL methods that only mitigate risk.
Entities
Institutions
- arXiv