DERL: Differentiable Evolutionary Reinforcement Learning for Reward Optimization
A new framework called Differentiable Evolutionary Reinforcement Learning (DERL) addresses the challenge of reward signal design in reinforcement learning. DERL uses a bi-level structure with a Meta-Optimizer that evolves reward functions from atomic primitives, introducing differentiability via policy gradients from inner-loop validation performance. This contrasts with prior black-box methods that treat reward functions as non-differentiable. The approach aims to exploit causal dynamics between reward modifications and policy outcomes for complex reasoning tasks.
Key facts
- DERL stands for Differentiable Evolutionary Reinforcement Learning
- It is a bi-level framework for autonomous discovery of optimal reward structures
- The Meta-Optimizer evolves reward functions through composition of atomic primitives
- Differentiability is introduced by updating the Meta-Optimizer using policy gradients
- Gradients are derived from inner-loop validation performance
- Prior methods treat reward functions as black boxes using derivative-free search
- The framework targets complex reasoning tasks in reinforcement learning
- The paper is available on arXiv with ID 2512.13399
Entities
Institutions
- arXiv