Posterior Sampling Boosts Offline RL Generalization
A new paper on arXiv (2605.07393) introduces Posterior Sampling-based Policy Optimization (PSPO) for model-based offline reinforcement learning. PSPO addresses the trade-off between generalization and robustness by formulating dynamics modeling as Bayesian inference, producing a posterior that quantifies model fidelity. It uses posterior sampling and constrained policy optimization to leverage dynamics-consistent out-of-distribution transitions for generalization while preventing exploitation. The approach aims to overcome excessive pessimistic regularization common in existing methods.
Key facts
- Paper available on arXiv with ID 2605.07393
- Proposes PSPO (Posterior Sampling-based Policy Optimization)
- Addresses generalization vs robustness trade-off in offline RL
- Uses Bayesian inference for dynamics modeling
- Employs posterior sampling and constrained policy optimization
- Leverages dynamics-consistent OOD transitions
- Aims to reduce excessive pessimistic regularization
Entities
Institutions
- arXiv