POETS: Efficient Policy Ensembles for LLM Uncertainty Quantification

ai-technology · 2026-05-11

A new framework named POETS (Policy Ensembles for Thompson Sampling) has been developed by researchers to tackle the exploration-exploitation dilemma in sequential decision-making and black-box optimization. This method utilizes the principle that policies trained with Kullback-Leibler (KL) regularization inherently represent reward functions. By training a policy ensemble, POETS effectively captures epistemic uncertainty by aligning these implicit reward functions with online bootstrapped data, eliminating the necessity for distinct uncertainty-aware reward models. To address the computational challenges associated with ensembling large language models (LLMs), the architecture utilizes a shared pre-trained backbone, which minimizes memory and computational demands. This approach is elaborated in arXiv preprint 2605.07775.

Key facts

POETS stands for Policy Ensembles for Thompson Sampling
It bridges uncertainty quantification and policy optimization
Policies with KL regularization encode implicit reward functions
Ensemble captures epistemic uncertainty via bootstrapped data
Shared pre-trained backbone reduces LLM ensembling costs
Addresses exploration-exploitation in sequential decision-making
Published on arXiv with ID 2605.07775
Method bypasses nested reward model training

POETS: Efficient Policy Ensembles for LLM Uncertainty Quantification

Key facts

Entities

Institutions

Sources