Logging Policy Design to Minimize Off-Policy Evaluation Error

other · 2026-05-16

A recent paper published on arXiv (2605.15108) tackles the challenge of creating logging policies for off-policy evaluation (OPE), which assesses the effectiveness of a target policy, such as a recommendation system, using data generated by a different logging policy. The authors identify a crucial tradeoff between reward and coverage: while focusing probability on high-reward actions can decrease variance, it may overlook important signals from actions the target policy could take. They introduce a comprehensive framework for designing logging policies and derive optimal strategies across three key informational scenarios: (i) when the target policy and reward distribution are known, (ii) when they are unknown, and (iii) when they are partially known through prior information or noisy estimates at the time of logging. Their findings offer practical advice for companies selecting logging policies to reduce OPE error.

Key facts

Paper arXiv:2605.15108 on off-policy evaluation (OPE)
Focuses on designing logging policies to minimize OPE error
Identifies reward-coverage tradeoff in logging policy design
Proposes a unifying framework for logging policy design
Derives optimal policies for known, unknown, and partially known regimes
Target policy and reward distribution are known in regime (i)
Target policy and reward distribution are unknown in regime (ii)
Partially known through priors or noisy estimates in regime (iii)

Logging Policy Design to Minimize Off-Policy Evaluation Error

Key facts

Entities

Institutions

Sources