Vector Policy Optimization Enhances LLM Diversity for Test-Time Search
A new reinforcement learning algorithm, Vector Policy Optimization (VPO), trains language models to produce diverse outputs for inference-time search procedures like AlphaEvolve. Standard post-training optimizes a single scalar reward, leading to low-entropy distributions that hinder search diversity. VPO exploits vector-valued rewards—such as per-test-case correctness in code or multiple user personas—and serves as a drop-in replacement for the GRPO advantage estimator. The approach explicitly trains policies to anticipate varied downstream reward functions and generate diverse solutions, addressing a key limitation in current LLM post-training paradigms.
Key facts
- VPO trains LLMs to produce diverse solutions for inference-time search.
- Standard post-training optimizes a single scalar reward, causing low-entropy outputs.
- VPO uses vector-valued rewards like per-test-case correctness or multiple user personas.
- VPO is a drop-in replacement for the GRPO advantage estimator.
- The algorithm targets diversity needed by search procedures such as AlphaEvolve.
- VPO explicitly trains policies to anticipate diverse downstream reward functions.
- The paper is published on arXiv with ID 2605.22817.
- The approach addresses a key limitation in current LLM post-training.
Entities
Institutions
- arXiv