ARTFEED — Contemporary Art Intelligence

POETS: Efficient Policy Ensembles for LLM Uncertainty Quantification

ai-technology · 2026-05-11

A new framework named POETS (Policy Ensembles for Thompson Sampling) has been developed by researchers to tackle the exploration-exploitation dilemma in sequential decision-making and black-box optimization. This method utilizes the principle that policies trained with Kullback-Leibler (KL) regularization inherently represent reward functions. By training a policy ensemble, POETS effectively captures epistemic uncertainty by aligning these implicit reward functions with online bootstrapped data, eliminating the necessity for distinct uncertainty-aware reward models. To address the computational challenges associated with ensembling large language models (LLMs), the architecture utilizes a shared pre-trained backbone, which minimizes memory and computational demands. This approach is elaborated in arXiv preprint 2605.07775.

Key facts

  • POETS stands for Policy Ensembles for Thompson Sampling
  • It bridges uncertainty quantification and policy optimization
  • Policies with KL regularization encode implicit reward functions
  • Ensemble captures epistemic uncertainty via bootstrapped data
  • Shared pre-trained backbone reduces LLM ensembling costs
  • Addresses exploration-exploitation in sequential decision-making
  • Published on arXiv with ID 2605.07775
  • Method bypasses nested reward model training

Entities

Institutions

  • arXiv

Sources