Q2RL: Extracting Q-Values from Behavior Cloning for Robot Learning
Behavior Cloning (BC) is effective for robot learning but lacks online improvement after demonstrations. Existing offline-to-online methods suffer from distribution mismatch. Q2RL (Q-Estimation and Q-Gating from BC for Reinforcement Learning) addresses this by extracting a Q-function from a BC policy with few environment interactions, then using Q-Gating to switch between BC and RL actions based on Q-values. It outperforms state-of-the-art baselines on D4RL and robomimic manipulation tasks.
Key facts
- Behavior Cloning lacks self-guided online improvement.
- Distribution mismatch between offline data and online learning causes policy replacement.
- Q2RL consists of Q-Estimation and Q-Gating.
- Q-Estimation extracts Q-function from BC policy using few interaction steps.
- Q-Gating switches between BC and RL actions based on Q-values.
- Evaluated on D4RL and robomimic benchmarks.
- Outperforms SOTA offline-to-online learning baselines.
- Published on arXiv with ID 2605.05172.
Entities
Institutions
- arXiv