Calibration Framework for Probabilistic Label Ranking

other · 2026-06-01

A new study formalizes calibration for probabilistic label ranking, a task where models predict distributions over orderings of a label set. The authors define a hierarchy of calibration notions covering full rankings, sub-rankings, and top-k rankings, proving that full-rank calibration implies the others but not vice versa, and that sub-ranking and top-k calibration are incomparable. Empirical tests show popular label ranking models are often poorly calibrated, with significant discrepancies between sub-ranking and top-k metrics. The framework is applied to RLHF reward models, revealing calibration issues in preference learning.

Key facts

Calibration aligns predicted probabilities with true outcome frequencies.
Label ranking predicts a distribution over orderings of a label set.
Full-rank calibration implies sub-ranking and top-k calibration.
Sub-ranking and top-k calibration are incomparable.
Popular label ranking models are often poorly calibrated.
Substantial differences exist between sub-ranking and top-k metrics.
The framework is applied to RLHF reward models.
The study is published on arXiv with ID 2605.30447.

Entities

—

Sources

arXiv cs.AI — 2026-06-01