ARTFEED — Contemporary Art Intelligence

Best-of-N Sampling in Reward Learning: Targets and Tradeoffs

other · 2026-06-01

A new analysis examines how Best-of-N sampling constructs pairwise preference data for Bradley-Terry reward learning. The study derives closed-form reward targets for independent-reference variants, showing they preserve latent reward ranking. For coupled variants like Best-vs-Random and Best-vs-Worst, exact representability fails but bounded-class minimizers approach reference targets as N grows. The work clarifies the role of N and base distribution in preference data construction.

Key facts

  • Best-of-N sampling is widely used to construct pairwise preference data
  • N candidates are drawn from a base distribution, best paired with rejected response
  • Analysis specializes a recent study of preference data via induced conditional distribution
  • Closed-form reward targets derived for independent-reference variants
  • Targets preserve latent reward ranking
  • Best-vs-Random and Best-vs-Worst variants couple chosen and rejected responses
  • Exact BT representability generally fails for coupled variants
  • Bounded-class minimizers approach reference targets as N grows

Entities

Institutions

  • arXiv

Sources