RL Framework Trains Prompting Policies for Black-Box LLMs via Iterative Distillation
A new Reinforcement Learning (RL) framework has been created to train prompting strategies for static black-box Large Language Models (LLMs) by refining experiences over time. This approach utilizes a simplified prompter model aimed at boosting task-specific rewards for a broader worker LLM. It incorporates a contrastive experience buffer that connects scalar rewards with in-depth text evaluations, enabling the merging of iterative prompt refinements into single-shot policy weights. Testing on the Big Bench Extra Hard (BBEH) and Tau-bench benchmarks showed performance improvements between 55% to 90% in logic-heavy reasoning and 74% to 91% in tool-related tasks. This innovative method addresses prompt engineering as a crucial optimization challenge when dealing with fixed LLMs.
Key facts
- Proposes an RL framework for training learned prompting policies via iterative distillation of experience.
- Uses a lightweight prompter model optimized to maximize task-specific rewards for a larger frozen worker LLM.
- Contrastive experience buffer couples scalar rewards with dense textual critiques.
- Amortizes iterative prompt refinement into single-shot policy weights.
- Experimental analysis on Big Bench Extra Hard (BBEH) and Tau-bench suites.
- Performance improved from 55% to 90% in logic-intensive reasoning tasks.
- Performance improved from 74% to 91% in tool-use tasks.
- Addresses prompt engineering as a critical optimization challenge for black-box LLMs.
Entities
Institutions
- arXiv