Unified Framework for f-Divergence Regularized RLHF

ai-technology · 2026-05-11

A new theoretical framework for Reinforcement Learning from Human Feedback (RLHF) with general f-divergence regularization has been developed. While existing RLHF methods primarily use reverse KL-regularization, recent empirical work has explored alternatives like forward KL and chi-squared divergences. This study provides a unified analysis across the entire f-divergence function class, proposing two algorithms based on distinct sampling principles: one extends the optimism principle with an exploration bonus, and the other exploits sensitivity of the objective. The work addresses the gap in theoretical understanding of general f-divergence regularization in online RLHF.

Key facts

The framework covers general f-divergence regularization in RLHF.
Existing approaches rely on reverse KL-regularization.
Recent empirical studies explore forward KL and chi-squared divergences.
Two algorithms are proposed: one based on optimism principle, another on sensitivity exploitation.
The work provides a unified theoretical analysis across the f-divergence function class.
The study focuses on online RLHF.
The algorithms use distinct sampling principles.
The framework fills a gap in theoretical understanding.

Entities

—

Sources

arXiv cs.AI — 2026-05-11