DERL: Differentiable Evolutionary Reinforcement Learning for Reward Optimization

ai-technology · 2026-05-14

A new framework called Differentiable Evolutionary Reinforcement Learning (DERL) addresses the challenge of reward signal design in reinforcement learning. DERL uses a bi-level structure with a Meta-Optimizer that evolves reward functions from atomic primitives, introducing differentiability via policy gradients from inner-loop validation performance. This contrasts with prior black-box methods that treat reward functions as non-differentiable. The approach aims to exploit causal dynamics between reward modifications and policy outcomes for complex reasoning tasks.

Key facts

DERL stands for Differentiable Evolutionary Reinforcement Learning
It is a bi-level framework for autonomous discovery of optimal reward structures
The Meta-Optimizer evolves reward functions through composition of atomic primitives
Differentiability is introduced by updating the Meta-Optimizer using policy gradients
Gradients are derived from inner-loop validation performance
Prior methods treat reward functions as black boxes using derivative-free search
The framework targets complex reasoning tasks in reinforcement learning
The paper is available on arXiv with ID 2512.13399

DERL: Differentiable Evolutionary Reinforcement Learning for Reward Optimization

Key facts

Entities

Institutions

Sources