DynaMO: A New Framework for Reinforcement Learning with Verifiable Rewards
A research paper on arXiv (2602.19208) proposes DynaMO, a dual-pronged optimization framework for Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Model (LLM) reasoning. The framework addresses two key challenges: uniform rollout allocation ignoring gradient variance heterogeneity across problems, and gradient attenuation for high-confidence correct actions due to softmax policy structure. At the sequence level, DynaMO derives variance-minimizing allocation from first principles, using Bernoulli variance as a proxy for gradient informativeness. At the token level, it develops gradient-aware advantage modulation based on theoretical analysis of gradient magnitude bounds. The paper proves that uniform allocation is suboptimal.
Key facts
- Paper is on arXiv with ID 2602.19208
- Proposes DynaMO framework for RLVR
- Addresses uniform rollout allocation
- Addresses gradient attenuation in softmax policy
- Sequence-level variance-minimizing allocation
- Uses Bernoulli variance as proxy
- Token-level gradient-aware advantage modulation
- Proves uniform allocation is suboptimal
Entities
Institutions
- arXiv