Delight-Gated Exploration: A New Heuristic for Bandit Problems
A novel algorithm known as Delight-gated exploration (DE) has been introduced for reinforcement learning and bandit challenges where the action space is excessively large to explore fully within a given budget. DE incorporates a host-override mechanism that allocates exploratory actions only when the anticipated delight—calculated as expected improvement multiplied by surprisal—surpasses a predetermined gate price. This approach reinstates Pandora's reservation-value principle for expensive searches, with surprisal determining the effective inspection cost. Resolved arms are allowed to exit the gate, while new arms are restricted above a set threshold, and selected linear-bandit overrides utilize a limited information budget. DE demonstrates significantly lower regret growth compared to Thompson Sampling and ε-greedy across Bernoulli bandits, linear bandits, and tabular MDPs, with the same hyperparameters applicable without retuning. The research can be found on arXiv with the identifier 2605.13287.
Key facts
- DE is a host-override rule for exploration.
- Delight is defined as expected improvement times surprisal.
- DE recovers Pandora's reservation-value rule.
- Surprisal sets the effective inspection cost.
- Resolved arms exit the gate.
- Fresh arms shut off above a prior-determined threshold.
- Hyperparameters transfer across Bernoulli bandits, linear bandits, and tabular MDPs without retuning.
- DE shows weaker regret growth than Thompson Sampling and ε-greedy.
Entities
Institutions
- arXiv