ARTFEED — Contemporary Art Intelligence

Benchmarking Expert-Guided RL Reveals Three Failure Modes

other · 2026-05-12

A recent study standardizes the evaluation of query-time expert-guided reinforcement learning techniques utilizing a common SAC framework along with unified hyperparameter optimization (HPO) and assessment protocols. This research employs 100/50 seeds for each environment-method combination and conducts a degradation sweep addressing expert undertuning, action bias, and observation noise. The findings reveal three failure modes overlooked in individual paper assessments: a critic blind spot with argmax-plus-bootstrap that causes IBRL to perform worse than no-expert SAC when experts are near the no-expert-RL ceiling; residual saturation with suboptimal experts; and warm-start buffer poisoning that undermines training-time-handoff methods in real-world deployment scenarios. The full paper can be accessed on arXiv.

Key facts

  • arXiv:2605.09109v1
  • Published on arXiv
  • Compares expert-guided RL methods on shared SAC backbone
  • Uses 100/50 seeds per (env, method)
  • Degradation sweep over expert undertuning, action bias, observation noise
  • Identifies three failure modes: critic blind spot, residual saturation, buffer poisoning
  • IBRL performs worse than no-expert SAC on near-ceiling experts
  • Training-time-handoff methods collapse under deployment-time conditions

Entities

Institutions

  • arXiv

Sources