DVPO: Distributional Value Modeling for Robust LLM Post-Training

ai-technology · 2026-05-07

A new reinforcement learning framework, DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), has been introduced to improve LLM post-training under noisy or incomplete supervision. The approach combines conditional risk theory with distributional value modeling to balance robustness and generalization. DVPO learns token-level value distributions for fine-grained supervision and applies asymmetric risk regularization to shape the distribution. It addresses limitations of existing methods like worst-case optimization (RFQI, CQL) and mean-based approaches (PPO, GRPO), which can be overly conservative or uneven across scenarios. The paper is available on arXiv under ID 2512.03847.

Key facts

DVPO stands for Distributional Value Modeling with Risk-aware Policy Optimization
The framework targets LLM post-training with noisy or incomplete supervision
It combines conditional risk theory with distributional value modeling
Token-level value distributions provide fine-grained supervision
Asymmetric risk regularization is applied to shape the distribution
Existing methods like RFQI, CQL, PPO, and GRPO are cited as less effective
The paper is on arXiv with ID 2512.03847
The announcement type is replace-cross

DVPO: Distributional Value Modeling for Robust LLM Post-Training

Key facts

Entities

Institutions

Sources