Best-of-N Sampling in Reward Learning: Targets and Tradeoffs
A new analysis examines how Best-of-N sampling constructs pairwise preference data for Bradley-Terry reward learning. The study derives closed-form reward targets for independent-reference variants, showing they preserve latent reward ranking. For coupled variants like Best-vs-Random and Best-vs-Worst, exact representability fails but bounded-class minimizers approach reference targets as N grows. The work clarifies the role of N and base distribution in preference data construction.
Key facts
- Best-of-N sampling is widely used to construct pairwise preference data
- N candidates are drawn from a base distribution, best paired with rejected response
- Analysis specializes a recent study of preference data via induced conditional distribution
- Closed-form reward targets derived for independent-reference variants
- Targets preserve latent reward ranking
- Best-vs-Random and Best-vs-Worst variants couple chosen and rejected responses
- Exact BT representability generally fails for coupled variants
- Bounded-class minimizers approach reference targets as N grows
Entities
Institutions
- arXiv