ARTFEED — Contemporary Art Intelligence

DVPO: Distributional Value Modeling for Robust LLM Post-Training

ai-technology · 2026-05-07

A new reinforcement learning framework, DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), has been introduced to improve LLM post-training under noisy or incomplete supervision. The approach combines conditional risk theory with distributional value modeling to balance robustness and generalization. DVPO learns token-level value distributions for fine-grained supervision and applies asymmetric risk regularization to shape the distribution. It addresses limitations of existing methods like worst-case optimization (RFQI, CQL) and mean-based approaches (PPO, GRPO), which can be overly conservative or uneven across scenarios. The paper is available on arXiv under ID 2512.03847.

Key facts

  • DVPO stands for Distributional Value Modeling with Risk-aware Policy Optimization
  • The framework targets LLM post-training with noisy or incomplete supervision
  • It combines conditional risk theory with distributional value modeling
  • Token-level value distributions provide fine-grained supervision
  • Asymmetric risk regularization is applied to shape the distribution
  • Existing methods like RFQI, CQL, PPO, and GRPO are cited as less effective
  • The paper is on arXiv with ID 2512.03847
  • The announcement type is replace-cross

Entities

Institutions

  • arXiv

Sources