Closed-Form Policy Steering for Frozen Offline RL Actors
A recent publication on arXiv (2604.22873) presents a novel closed-form strategy for modifying frozen offline reinforcement learning (RL) policies during deployment without the need for retraining. This technique employs Product-of-Experts (PoE) composition alongside a goal-conditioned prior. A significant insight reveals that precision-weighted composition maintains stability even with degraded or random priors, staying connected to the frozen actor, whereas additive and prior-only adaptations fail. The KL-budget selector frequently achieves performance close to an oracle. For diagonal-Gaussian actors and priors, PoE with an alpha coefficient produces the same deterministic policy as KL-regularized adaptation with beta set to alpha / (1 - alpha). This research addresses situations where retraining is unfeasible due to limitations in data, costs, or governance.
Key facts
- arXiv paper 2604.22873
- Offline RL policy adaptation without retraining
- Product-of-Experts composition with goal-conditioned prior
- Precision-weighted composition shows graceful degradation
- Additive and prior-only adaptation collapse under degraded priors
- KL-budget selector recovers near-oracle operating point
- Closed-form identity: PoE with alpha equals KL-regularized with beta = alpha/(1-alpha)
- Frozen actor setting for diagonal-Gaussian distributions
Entities
Institutions
- arXiv