Logging Policy Design to Minimize Off-Policy Evaluation Error
A recent paper published on arXiv (2605.15108) tackles the challenge of creating logging policies for off-policy evaluation (OPE), which assesses the effectiveness of a target policy, such as a recommendation system, using data generated by a different logging policy. The authors identify a crucial tradeoff between reward and coverage: while focusing probability on high-reward actions can decrease variance, it may overlook important signals from actions the target policy could take. They introduce a comprehensive framework for designing logging policies and derive optimal strategies across three key informational scenarios: (i) when the target policy and reward distribution are known, (ii) when they are unknown, and (iii) when they are partially known through prior information or noisy estimates at the time of logging. Their findings offer practical advice for companies selecting logging policies to reduce OPE error.
Key facts
- Paper arXiv:2605.15108 on off-policy evaluation (OPE)
- Focuses on designing logging policies to minimize OPE error
- Identifies reward-coverage tradeoff in logging policy design
- Proposes a unifying framework for logging policy design
- Derives optimal policies for known, unknown, and partially known regimes
- Target policy and reward distribution are known in regime (i)
- Target policy and reward distribution are unknown in regime (ii)
- Partially known through priors or noisy estimates in regime (iii)
Entities
Institutions
- arXiv