DWD Phenomenon Enables Efficient Gradient Reuse in RLVR for LLMs

ai-technology · 2026-05-20

Researchers have identified the Disproportionate Weight Divergence (DWD) phenomenon, which allows for sample-efficient reinforcement learning with verifiable rewards (RLVR) in large language models (LLMs). RLVR is crucial for advanced reasoning but suffers from high sample costs due to expensive rollout batches. Reusing batches for multiple gradient updates, common in classical RL, causes policy shift and performance degradation in RLVR. The DWD phenomenon shows that degradation correlates with a sharp surge in lm_head weight change while intermediate layers remain stable. This enables early detection of when to stop reusing samples. The finding is empirically verified across diverse LLMs and tasks, with theoretical proof that harmful gradients concentrate at the lm_head. The work addresses a critical bottleneck in RLVR, making it more practical for training advanced reasoning models.

Key facts

RLVR is a dominant paradigm for advanced reasoning in LLMs.
Rollout samples are expensive, making sample efficiency critical.
Reusing rollout batches for multiple gradient updates amplifies policy shift in RLVR.
DWD stands for Disproportionate Weight Divergence.
Performance degradation synchronizes with a sharp surge in lm_head weight change.
Intermediate layers remain stable during DWD.
DWD emerges consistently across diverse LLMs and tasks.
Theoretical proof shows harmful gradients concentrate at the lm_head.

Entities

—

Sources

arXiv cs.AI — 2026-05-20