LLMs as Policy Optimizers for Sequential RL Tasks

ai-technology · 2026-06-01

A recent study explores the potential for large language models (LLMs) to supplant conventional reinforcement learning (RL) techniques for policy optimization. Researchers developed a method known as Prompted Policy Optimization (PromptPO), which employs Python-based descriptions to interact with the LLM, facilitating the generation and enhancement of executable policies via simulation feedback. In experiments involving intricate exploratory scenarios, robotic tasks in Meta-World, and real-world control challenges, PromptPO frequently matched or exceeded the performance of traditional RL methods while requiring fewer interactions with the environment. The resulting policies included varied approaches, from modified controllers to planning algorithms.

Key facts

Prompted Policy Optimization (PromptPO) uses LLMs to generate and refine RL policies
LLM is prompted with Python descriptions of state space, action space, and reward function
PromptPO tested on hard exploration environments, Meta-World robotics, and real-world control problems
Often matches or exceeds standard RL baselines with fewer environment interactions
Policies range from proportional controllers to value iteration algorithms
Study explores when LLMs can replace classical RL algorithms
Method is iterative and uses rollout feedback
No explicit prompting for specific policy types

Entities

—

Sources

arXiv cs.AI — 2026-06-01