ARTFEED — Contemporary Art Intelligence

Delight-Gated Exploration: A New Heuristic for Bandit Problems

other · 2026-05-14

A novel algorithm known as Delight-gated exploration (DE) has been introduced for reinforcement learning and bandit challenges where the action space is excessively large to explore fully within a given budget. DE incorporates a host-override mechanism that allocates exploratory actions only when the anticipated delight—calculated as expected improvement multiplied by surprisal—surpasses a predetermined gate price. This approach reinstates Pandora's reservation-value principle for expensive searches, with surprisal determining the effective inspection cost. Resolved arms are allowed to exit the gate, while new arms are restricted above a set threshold, and selected linear-bandit overrides utilize a limited information budget. DE demonstrates significantly lower regret growth compared to Thompson Sampling and ε-greedy across Bernoulli bandits, linear bandits, and tabular MDPs, with the same hyperparameters applicable without retuning. The research can be found on arXiv with the identifier 2605.13287.

Key facts

  • DE is a host-override rule for exploration.
  • Delight is defined as expected improvement times surprisal.
  • DE recovers Pandora's reservation-value rule.
  • Surprisal sets the effective inspection cost.
  • Resolved arms exit the gate.
  • Fresh arms shut off above a prior-determined threshold.
  • Hyperparameters transfer across Bernoulli bandits, linear bandits, and tabular MDPs without retuning.
  • DE shows weaker regret growth than Thompson Sampling and ε-greedy.

Entities

Institutions

  • arXiv

Sources