ARTFEED — Contemporary Art Intelligence

Vector Policy Optimization Enhances LLM Diversity for Test-Time Search

ai-technology · 2026-05-23

A new reinforcement learning algorithm, Vector Policy Optimization (VPO), trains language models to produce diverse outputs for inference-time search procedures like AlphaEvolve. Standard post-training optimizes a single scalar reward, leading to low-entropy distributions that hinder search diversity. VPO exploits vector-valued rewards—such as per-test-case correctness in code or multiple user personas—and serves as a drop-in replacement for the GRPO advantage estimator. The approach explicitly trains policies to anticipate varied downstream reward functions and generate diverse solutions, addressing a key limitation in current LLM post-training paradigms.

Key facts

  • VPO trains LLMs to produce diverse solutions for inference-time search.
  • Standard post-training optimizes a single scalar reward, causing low-entropy outputs.
  • VPO uses vector-valued rewards like per-test-case correctness or multiple user personas.
  • VPO is a drop-in replacement for the GRPO advantage estimator.
  • The algorithm targets diversity needed by search procedures such as AlphaEvolve.
  • VPO explicitly trains policies to anticipate diverse downstream reward functions.
  • The paper is published on arXiv with ID 2605.22817.
  • The approach addresses a key limitation in current LLM post-training.

Entities

Institutions

  • arXiv

Sources