Diagnostic-Driven Refinement Boosts LLM Reward Design for Sparse RL
A recent arXiv preprint (2605.28918) reinterprets the process of reward shaping generated by LLMs for sparse, structured reinforcement-learning tasks as a debugging approach instead of a one-time generation. The researchers examined PPO-trained agents on MiniGrid (core) and MuJoCo (boundary), discovering two primary failure modes: reward flooding and semantic/API misunderstanding, along with a less common weak-shaping scenario. They suggest an iterative refinement process driven by diagnostics, where training diagnostics and a failure-mode taxonomy inform specific reward-function adjustments. This refinement led to significant improvements, with DoorKey-8x8 rising from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7%, despite notable seed-to-seed variability. Control experiments indicate that these gains are not due to retraining or retries, as metrics-only re-prompting resulted in substantial declines, while a static-vocabulary control recovered much of the performance gap (87.6%; 70.7%), highlighting the taxonomy prompt's key role and the added advantage of dynamic labels.
Key facts
- arXiv preprint 2605.28918
- LLM-generated reward shaping framed as debugging
- PPO-trained agents on MiniGrid and MuJoCo
- Dominant failure modes: reward flooding, semantic/API misunderstanding
- Rarer weak-shaping case identified
- Diagnostic-driven iterative refinement proposed
- DoorKey-8x8 improved from 2.3% to 97.6%
- KeyCorridor improved from 31.2% to 86.7%
- High seed-to-seed variance in results
- Metrics-only re-prompting yields large drops
- Static-vocabulary control recovers 87.6% and 70.7%
- Taxonomy prompt is a major mechanism
- Dynamic labels provide additional benefit
Entities
Institutions
- arXiv