DVPO: Distributional Value Modeling for Robust LLM Post-Training
A new reinforcement learning framework, DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), has been introduced to improve LLM post-training under noisy or incomplete supervision. The approach combines conditional risk theory with distributional value modeling to balance robustness and generalization. DVPO learns token-level value distributions for fine-grained supervision and applies asymmetric risk regularization to shape the distribution. It addresses limitations of existing methods like worst-case optimization (RFQI, CQL) and mean-based approaches (PPO, GRPO), which can be overly conservative or uneven across scenarios. The paper is available on arXiv under ID 2512.03847.
Key facts
- DVPO stands for Distributional Value Modeling with Risk-aware Policy Optimization
- The framework targets LLM post-training with noisy or incomplete supervision
- It combines conditional risk theory with distributional value modeling
- Token-level value distributions provide fine-grained supervision
- Asymmetric risk regularization is applied to shape the distribution
- Existing methods like RFQI, CQL, PPO, and GRPO are cited as less effective
- The paper is on arXiv with ID 2512.03847
- The announcement type is replace-cross
Entities
Institutions
- arXiv