ROAD: Adaptive Data Mixing for Offline-to-Online RL via Bi-Level Optimization

other · 2026-05-16

The newly introduced framework, ROAD (Reinforcement Learning with Optimized Adaptive Data-mixing), tackles the issue of distribution shift in offline-to-online reinforcement learning. By framing data selection as a bi-level optimization problem, ROAD automates data replay and considers the mixing strategy as a meta-decision impacting policy performance. This innovative method addresses the core objective misalignment found in current techniques that depend on fixed mixing ratios or heuristic replay strategies, which do not adjust to varying environments and training dynamics. ROAD functions as a dynamic plug-and-play framework aimed at enhancing the balance between stability and long-term performance.

Key facts

ROAD is a framework for offline-to-online reinforcement learning.
It addresses non-stationary distribution shift between offline datasets and online policy.
Existing approaches use static mixing ratios or heuristic-based replay strategies.
ROAD formulates data selection as a bi-level optimization process.
The data mixing strategy is interpreted as a meta-decision governing policy performance.
ROAD is a dynamic plug-and-play framework.
It aims to improve the tradeoff between stability and asymptotic performance.
The framework automates the data replay process.

Entities

—

Sources

arXiv cs.AI — 2026-05-16