SPAR: Support-Preserving Action Rectification for Offline Policy Improvement
A new framework called Support-Preserving Action Rectification (SPAR) has been introduced by researchers to enhance offline policy improvement, tackling the challenge of balancing value maximization with data distribution alignment. SPAR reinterprets global learning as local residual rectification, relying on a static behavior cloning policy. This approach enables precise fitting and local policy enhancement within the residual space, thereby narrowing the search area. Additionally, the framework features Latent Self-Imitation, a mechanism utilizing latent-sampling weighted regression to address the conflict between fitting and improvement gradients. The theoretical foundation of this mechanism resolves the fitting-optimization dilemma. The study is available on arXiv under the identifier 2605.27877.
Key facts
- SPAR stands for Support-Preserving Action Rectification.
- It addresses offline policy improvement conflict between maximizing value and fitting data distribution.
- In-sample weighted regression suffers from over-conservatism suppressing high-value actions.
- Gradient-based approaches exhibit fitting-optimization gradient conflict.
- SPAR reframes global learning as local residual rectification anchored to frozen behavior cloning policy.
- It performs fine-grained fitting and local policy improvement in residual space.
- Latent Self-Imitation uses latent-sampling weighted-regression to address gradient conflict.
- The paper is on arXiv with ID 2605.27877.
Entities
Institutions
- arXiv