Unified Framework for f-Divergence Regularized RLHF
A new theoretical framework for Reinforcement Learning from Human Feedback (RLHF) with general f-divergence regularization has been developed. While existing RLHF methods primarily use reverse KL-regularization, recent empirical work has explored alternatives like forward KL and chi-squared divergences. This study provides a unified analysis across the entire f-divergence function class, proposing two algorithms based on distinct sampling principles: one extends the optimism principle with an exploration bonus, and the other exploits sensitivity of the objective. The work addresses the gap in theoretical understanding of general f-divergence regularization in online RLHF.
Key facts
- The framework covers general f-divergence regularization in RLHF.
- Existing approaches rely on reverse KL-regularization.
- Recent empirical studies explore forward KL and chi-squared divergences.
- Two algorithms are proposed: one based on optimism principle, another on sensitivity exploitation.
- The work provides a unified theoretical analysis across the f-divergence function class.
- The study focuses on online RLHF.
- The algorithms use distinct sampling principles.
- The framework fills a gap in theoretical understanding.
Entities
—