Closed-Form Policy Steering for Frozen Offline RL Actors

other · 2026-04-29

A recent publication on arXiv (2604.22873) presents a novel closed-form strategy for modifying frozen offline reinforcement learning (RL) policies during deployment without the need for retraining. This technique employs Product-of-Experts (PoE) composition alongside a goal-conditioned prior. A significant insight reveals that precision-weighted composition maintains stability even with degraded or random priors, staying connected to the frozen actor, whereas additive and prior-only adaptations fail. The KL-budget selector frequently achieves performance close to an oracle. For diagonal-Gaussian actors and priors, PoE with an alpha coefficient produces the same deterministic policy as KL-regularized adaptation with beta set to alpha / (1 - alpha). This research addresses situations where retraining is unfeasible due to limitations in data, costs, or governance.

Key facts

arXiv paper 2604.22873
Offline RL policy adaptation without retraining
Product-of-Experts composition with goal-conditioned prior
Precision-weighted composition shows graceful degradation
Additive and prior-only adaptation collapse under degraded priors
KL-budget selector recovers near-oracle operating point
Closed-form identity: PoE with alpha equals KL-regularized with beta = alpha/(1-alpha)
Frozen actor setting for diagonal-Gaussian distributions

Closed-Form Policy Steering for Frozen Offline RL Actors

Key facts

Entities

Institutions

Sources