DACA-GRPO Enhances Reinforcement Learning for Diffusion Language Models
A new paper on arXiv (2605.16342) proposes DACA-GRPO (Denoising-Aware Credit Assignment for GRPO), a method to improve reinforcement learning in diffusion large language models. The authors identify two weaknesses in existing RL approaches: lack of temporal credit assignment across denoising steps and biased mean-field likelihood estimates. DACA-GRPO introduces Denoising Progress Scores for per-token importance weights and Stratified Masking Likelihood to reduce bias. It is designed as a plug-and-play enhancement for GRPO-style trainers.
Key facts
- arXiv paper 2605.16342 introduces DACA-GRPO
- DACA-GRPO addresses temporal credit assignment in diffusion LLMs
- Denoising Progress Scores extract per-token importance weights
- Stratified Masking Likelihood partitions token positions into strata
- Method is a plug-and-play enhancement for GRPO-style trainers
- Existing RL methods treat all denoising steps as equally important
- Mean-field likelihood estimates are systematically biased
- DACA-GRPO requires no additional forward cost
Entities
Institutions
- arXiv