New AI Research Proposes Explained Variance Policy Optimization for LLM Post-Training

ai-technology · 2026-04-22

A recent study presents Explained Variance Policy Optimization (EVPO), a strategy for deciding when to incorporate a learned critic in reinforcement learning for post-training of large language models. This research tackles a critical design decision in RL for LLMs: the choice between critic-based methods like Proximal Policy Optimization and critic-free techniques such as GRPO. Traditional theory often supports critic-based methods for reducing variance. Nonetheless, the authors reveal that in environments with sparse rewards, a learned critic may introduce estimation noise that exceeds the state signal, potentially increasing advantage variance. By conceptualizing baseline selection as a Kalman filtering issue, the paper connects PPO and GRPO as two extremes of Kalman gain. The authors establish that explained variance, calculable from a single training batch, delineates the exact boundary: positive EV suggests the critic diminishes variance, while zero or negative EV indicates an increase. This theoretical framework underpins the EVPO method, which dynamically employs critics based on the explained variance metric. The findings were shared on arXiv with the identifier 2604.19485v1, highlighting the growing preference for critic-free methods due to their ease of use and strong performance, despite theoretical support for critic-based methods.

Key facts

The research paper introduces Explained Variance Policy Optimization (EVPO).
EVPO addresses whether to use a learned critic in RL for LLM post-training.
In sparse-reward settings, a learned critic can increase advantage variance.
Baseline selection is cast as a Kalman filtering problem.
Explained variance identifies when a critic reduces or inflates variance.
The paper unifies PPO and GRPO as two extremes of the Kalman gain.
The research was announced on arXiv under identifier 2604.19485v1.
The announcement type is cross.

New AI Research Proposes Explained Variance Policy Optimization for LLM Post-Training

Key facts

Entities

Institutions

Sources