Crowd Preferences Reveal Shared Safety Criteria for RL
A recent preprint on arXiv (2605.21822) presents Safe Crowd Preference-based RL (SCP-RL), a hierarchical system designed to derive common safety standards from crowd preference data. The researchers highlight the drawbacks of directly combining rewards—optimizing a reward model based on preferences with those of downstream tasks. In contrast, SCP-RL identifies safety-oriented skills from crowd preferences and integrates them through a high-level policy to address downstream tasks safely. Validation of this method comes from experiments in safe RL settings and an initial LLM-style task. The study emphasizes shared safety principles within crowd preferences, noting that while users may have varying goals, they often adhere to similar safety protocols.
Key facts
- arXiv paper 2605.21822 proposes Safe Crowd Preference-based RL (SCP-RL)
- SCP-RL extracts shared safety criteria from crowd preference datasets
- Direct reward combination has inherent limitations for safety alignment
- Hierarchical framework extracts safety-aligned skills from crowd preferences
- Skills are composed via a high-level policy for downstream tasks
- Experiments conducted in safe RL environments and LLM-style tasks
- Crowd preferences contain common safety principles despite diverse user objectives
- Method transfers safety criteria from crowd data to downstream RL tasks
Entities
Institutions
- arXiv