ARTFEED — Contemporary Art Intelligence

RL Framework Trains Prompting Policies for Black-Box LLMs via Iterative Distillation

ai-technology · 2026-05-16

A new Reinforcement Learning (RL) framework has been created to train prompting strategies for static black-box Large Language Models (LLMs) by refining experiences over time. This approach utilizes a simplified prompter model aimed at boosting task-specific rewards for a broader worker LLM. It incorporates a contrastive experience buffer that connects scalar rewards with in-depth text evaluations, enabling the merging of iterative prompt refinements into single-shot policy weights. Testing on the Big Bench Extra Hard (BBEH) and Tau-bench benchmarks showed performance improvements between 55% to 90% in logic-heavy reasoning and 74% to 91% in tool-related tasks. This innovative method addresses prompt engineering as a crucial optimization challenge when dealing with fixed LLMs.

Key facts

  • Proposes an RL framework for training learned prompting policies via iterative distillation of experience.
  • Uses a lightweight prompter model optimized to maximize task-specific rewards for a larger frozen worker LLM.
  • Contrastive experience buffer couples scalar rewards with dense textual critiques.
  • Amortizes iterative prompt refinement into single-shot policy weights.
  • Experimental analysis on Big Bench Extra Hard (BBEH) and Tau-bench suites.
  • Performance improved from 55% to 90% in logic-intensive reasoning tasks.
  • Performance improved from 74% to 91% in tool-use tasks.
  • Addresses prompt engineering as a critical optimization challenge for black-box LLMs.

Entities

Institutions

  • arXiv

Sources