Benchmarking Expert-Guided RL Reveals Three Failure Modes

other · 2026-05-12

A recent study standardizes the evaluation of query-time expert-guided reinforcement learning techniques utilizing a common SAC framework along with unified hyperparameter optimization (HPO) and assessment protocols. This research employs 100/50 seeds for each environment-method combination and conducts a degradation sweep addressing expert undertuning, action bias, and observation noise. The findings reveal three failure modes overlooked in individual paper assessments: a critic blind spot with argmax-plus-bootstrap that causes IBRL to perform worse than no-expert SAC when experts are near the no-expert-RL ceiling; residual saturation with suboptimal experts; and warm-start buffer poisoning that undermines training-time-handoff methods in real-world deployment scenarios. The full paper can be accessed on arXiv.

Key facts

arXiv:2605.09109v1
Published on arXiv
Compares expert-guided RL methods on shared SAC backbone
Uses 100/50 seeds per (env, method)
Degradation sweep over expert undertuning, action bias, observation noise
Identifies three failure modes: critic blind spot, residual saturation, buffer poisoning
IBRL performs worse than no-expert SAC on near-ceiling experts
Training-time-handoff methods collapse under deployment-time conditions

Benchmarking Expert-Guided RL Reveals Three Failure Modes

Key facts

Entities

Institutions

Sources