DynaMO: A New Framework for Reinforcement Learning with Verifiable Rewards

other · 2026-04-25

A research paper on arXiv (2602.19208) proposes DynaMO, a dual-pronged optimization framework for Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Model (LLM) reasoning. The framework addresses two key challenges: uniform rollout allocation ignoring gradient variance heterogeneity across problems, and gradient attenuation for high-confidence correct actions due to softmax policy structure. At the sequence level, DynaMO derives variance-minimizing allocation from first principles, using Bernoulli variance as a proxy for gradient informativeness. At the token level, it develops gradient-aware advantage modulation based on theoretical analysis of gradient magnitude bounds. The paper proves that uniform allocation is suboptimal.

Key facts

Paper is on arXiv with ID 2602.19208
Proposes DynaMO framework for RLVR
Addresses uniform rollout allocation
Addresses gradient attenuation in softmax policy
Sequence-level variance-minimizing allocation
Uses Bernoulli variance as proxy
Token-level gradient-aware advantage modulation
Proves uniform allocation is suboptimal

DynaMO: A New Framework for Reinforcement Learning with Verifiable Rewards

Key facts

Entities

Institutions

Sources