Listwise Policy Optimization for LLM Post-Training

publication · 2026-05-09

A recent study published on arXiv (2605.06139) presents Listwise Policy Optimization (LPO), a technique aimed at enhancing large language models post-training through reinforcement learning that utilizes verifiable rewards. The researchers identify a shared geometric framework among current group-based policy gradient approaches: these methods implicitly establish a target distribution on the response simplex and aim towards it using first-order approximations. In contrast, LPO explicitly manages this target projection by confining the proximal RL objective to the response simplex and minimizing divergence accurately. This framework guarantees consistent improvements on the listwise objective, exhibiting bounded, zero-sum, and self-correcting characteristics.

Key facts

arXiv paper 2605.06139
Introduces Listwise Policy Optimization (LPO)
Focuses on RLVR for LLM post-training
Reveals geometric structure in group-based policy gradient
Uses target-projection on response simplex
Provides monotonic improvement
Bounded, zero-sum, self-correction properties

Listwise Policy Optimization for LLM Post-Training

Key facts

Entities

Institutions

Sources