Safe-Support Q-Learning Prevents Unsafe Exploration in RL

ai-technology · 2026-04-30

A new reinforcement learning framework called Safe-Support Q-Learning (SSQL) eliminates unsafe state visitation during training, addressing a critical challenge in real-world applications where hazardous exploration can cause catastrophic failures. Unlike conventional safe RL methods that merely mitigate risk through constraints or penalties while still permitting unsafe exploration, SSQL enforces a stricter safety requirement by ensuring all training trajectories remain within a predefined safe set. The framework employs a behavior policy supported solely on this safe set, enabling sufficient exploration without requiring near-optimal performance. It adopts a two-stage architecture where the Q-function and policy are trained separately, using a KL-regularized Bellman target to keep the Q-function close to the behavior policy. The approach is detailed in a preprint on arXiv (2604.25379), highlighting its potential for safety-critical domains such as autonomous driving, robotics, and healthcare.

Key facts

Safe-Support Q-Learning (SSQL) eliminates unsafe state visitation during RL training.
It uses a behavior policy supported on a safe set to ensure trajectories remain safe.
The framework adopts a two-stage training process for Q-function and policy.
A KL-regularized Bellman target constrains the Q-function to stay near the behavior policy.
The method does not require near-optimal performance for safe exploration.
It addresses safety in real-world applications like autonomous driving and robotics.
The preprint is available on arXiv with ID 2604.25379.
SSQL is stricter than conventional safe RL methods that only mitigate risk.

Safe-Support Q-Learning Prevents Unsafe Exploration in RL

Key facts

Entities

Institutions

Sources