POETS: Efficient Policy Ensembles for LLM Uncertainty Quantification
A new framework named POETS (Policy Ensembles for Thompson Sampling) has been developed by researchers to tackle the exploration-exploitation dilemma in sequential decision-making and black-box optimization. This method utilizes the principle that policies trained with Kullback-Leibler (KL) regularization inherently represent reward functions. By training a policy ensemble, POETS effectively captures epistemic uncertainty by aligning these implicit reward functions with online bootstrapped data, eliminating the necessity for distinct uncertainty-aware reward models. To address the computational challenges associated with ensembling large language models (LLMs), the architecture utilizes a shared pre-trained backbone, which minimizes memory and computational demands. This approach is elaborated in arXiv preprint 2605.07775.
Key facts
- POETS stands for Policy Ensembles for Thompson Sampling
- It bridges uncertainty quantification and policy optimization
- Policies with KL regularization encode implicit reward functions
- Ensemble captures epistemic uncertainty via bootstrapped data
- Shared pre-trained backbone reduces LLM ensembling costs
- Addresses exploration-exploitation in sequential decision-making
- Published on arXiv with ID 2605.07775
- Method bypasses nested reward model training
Entities
Institutions
- arXiv