Q2RL: Extracting Q-Values from Behavior Cloning for Robot Learning

other · 2026-05-07

Behavior Cloning (BC) is effective for robot learning but lacks online improvement after demonstrations. Existing offline-to-online methods suffer from distribution mismatch. Q2RL (Q-Estimation and Q-Gating from BC for Reinforcement Learning) addresses this by extracting a Q-function from a BC policy with few environment interactions, then using Q-Gating to switch between BC and RL actions based on Q-values. It outperforms state-of-the-art baselines on D4RL and robomimic manipulation tasks.

Key facts

Behavior Cloning lacks self-guided online improvement.
Distribution mismatch between offline data and online learning causes policy replacement.
Q2RL consists of Q-Estimation and Q-Gating.
Q-Estimation extracts Q-function from BC policy using few interaction steps.
Q-Gating switches between BC and RL actions based on Q-values.
Evaluated on D4RL and robomimic benchmarks.
Outperforms SOTA offline-to-online learning baselines.
Published on arXiv with ID 2605.05172.

Q2RL: Extracting Q-Values from Behavior Cloning for Robot Learning

Key facts

Entities

Institutions

Sources