RL Framework Trains Prompting Policies for Black-Box LLMs via Iterative Distillation

ai-technology · 2026-05-16

A new Reinforcement Learning (RL) framework has been created to train prompting strategies for static black-box Large Language Models (LLMs) by refining experiences over time. This approach utilizes a simplified prompter model aimed at boosting task-specific rewards for a broader worker LLM. It incorporates a contrastive experience buffer that connects scalar rewards with in-depth text evaluations, enabling the merging of iterative prompt refinements into single-shot policy weights. Testing on the Big Bench Extra Hard (BBEH) and Tau-bench benchmarks showed performance improvements between 55% to 90% in logic-heavy reasoning and 74% to 91% in tool-related tasks. This innovative method addresses prompt engineering as a crucial optimization challenge when dealing with fixed LLMs.

Key facts

Proposes an RL framework for training learned prompting policies via iterative distillation of experience.
Uses a lightweight prompter model optimized to maximize task-specific rewards for a larger frozen worker LLM.
Contrastive experience buffer couples scalar rewards with dense textual critiques.
Amortizes iterative prompt refinement into single-shot policy weights.
Experimental analysis on Big Bench Extra Hard (BBEH) and Tau-bench suites.
Performance improved from 55% to 90% in logic-intensive reasoning tasks.
Performance improved from 74% to 91% in tool-use tasks.
Addresses prompt engineering as a critical optimization challenge for black-box LLMs.

RL Framework Trains Prompting Policies for Black-Box LLMs via Iterative Distillation

Key facts

Entities

Institutions

Sources