Vector Policy Optimization Enhances LLM Diversity for Test-Time Search

ai-technology · 2026-05-23

A new reinforcement learning algorithm, Vector Policy Optimization (VPO), trains language models to produce diverse outputs for inference-time search procedures like AlphaEvolve. Standard post-training optimizes a single scalar reward, leading to low-entropy distributions that hinder search diversity. VPO exploits vector-valued rewards—such as per-test-case correctness in code or multiple user personas—and serves as a drop-in replacement for the GRPO advantage estimator. The approach explicitly trains policies to anticipate varied downstream reward functions and generate diverse solutions, addressing a key limitation in current LLM post-training paradigms.

Key facts

VPO trains LLMs to produce diverse solutions for inference-time search.
Standard post-training optimizes a single scalar reward, causing low-entropy outputs.
VPO uses vector-valued rewards like per-test-case correctness or multiple user personas.
VPO is a drop-in replacement for the GRPO advantage estimator.
The algorithm targets diversity needed by search procedures such as AlphaEvolve.
VPO explicitly trains policies to anticipate diverse downstream reward functions.
The paper is published on arXiv with ID 2605.22817.
The approach addresses a key limitation in current LLM post-training.

Vector Policy Optimization Enhances LLM Diversity for Test-Time Search

Key facts

Entities

Institutions

Sources