Delight-Gated Exploration: A New Heuristic for Bandit Problems

other · 2026-05-14

A novel algorithm known as Delight-gated exploration (DE) has been introduced for reinforcement learning and bandit challenges where the action space is excessively large to explore fully within a given budget. DE incorporates a host-override mechanism that allocates exploratory actions only when the anticipated delight—calculated as expected improvement multiplied by surprisal—surpasses a predetermined gate price. This approach reinstates Pandora's reservation-value principle for expensive searches, with surprisal determining the effective inspection cost. Resolved arms are allowed to exit the gate, while new arms are restricted above a set threshold, and selected linear-bandit overrides utilize a limited information budget. DE demonstrates significantly lower regret growth compared to Thompson Sampling and ε-greedy across Bernoulli bandits, linear bandits, and tabular MDPs, with the same hyperparameters applicable without retuning. The research can be found on arXiv with the identifier 2605.13287.

Key facts

DE is a host-override rule for exploration.
Delight is defined as expected improvement times surprisal.
DE recovers Pandora's reservation-value rule.
Surprisal sets the effective inspection cost.
Resolved arms exit the gate.
Fresh arms shut off above a prior-determined threshold.
Hyperparameters transfer across Bernoulli bandits, linear bandits, and tabular MDPs without retuning.
DE shows weaker regret growth than Thompson Sampling and ε-greedy.

Delight-Gated Exploration: A New Heuristic for Bandit Problems

Key facts

Entities

Institutions

Sources