PREFINE: Preference-Based Safety Fine-Tuning for RL Policies

other · 2026-05-22

Researchers have developed PREFINE, a novel approach aimed at enhancing pre-trained reinforcement learning policies by integrating cost constraints through preference data. Unlike traditional RLHF, which focuses on preferences for responses to identical prompts, PREFINE utilizes trajectory-level preferences within continuous control settings. This method modifies Direct Preference Optimization (DPO), commonly applied in fine-tuning large language models, for sequential decision-making tasks. By leveraging a reward-optimized policy alongside a limited dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, PREFINE fine-tunes the policy to encourage low-cost actions while maintaining high rewards. This technique eliminates the need for complete retraining, thus providing an efficient solution for safety alignment.

Key facts

PREFINE stands for Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment.
It addresses safety alignment in reinforcement learning by incorporating cost constraints.
Costs are provided as preferences rather than numerical values.
The method uses trajectory-level preferences in continuous control environments.
It adapts Direct Preference Optimization (DPO) from LLM fine-tuning to sequential decision-making.
The goal is to generate low-cost behaviors while maintaining high rewards.
The approach avoids retraining the policy from scratch.
The paper is available on arXiv under ID 2605.21225.

PREFINE: Preference-Based Safety Fine-Tuning for RL Policies

Key facts

Entities

Institutions

Sources