Best-of-N Sampling in Reward Learning: Targets and Tradeoffs

other · 2026-06-01

A new analysis examines how Best-of-N sampling constructs pairwise preference data for Bradley-Terry reward learning. The study derives closed-form reward targets for independent-reference variants, showing they preserve latent reward ranking. For coupled variants like Best-vs-Random and Best-vs-Worst, exact representability fails but bounded-class minimizers approach reference targets as N grows. The work clarifies the role of N and base distribution in preference data construction.

Key facts

Best-of-N sampling is widely used to construct pairwise preference data
N candidates are drawn from a base distribution, best paired with rejected response
Analysis specializes a recent study of preference data via induced conditional distribution
Closed-form reward targets derived for independent-reference variants
Targets preserve latent reward ranking
Best-vs-Random and Best-vs-Worst variants couple chosen and rejected responses
Exact BT representability generally fails for coupled variants
Bounded-class minimizers approach reference targets as N grows

Best-of-N Sampling in Reward Learning: Targets and Tradeoffs

Key facts

Entities

Institutions

Sources