Model Exploitation in Reinforcement Learning: A New Definition
A new arXiv preprint (2605.15960) proposes a formal definition of model exploitation in reinforcement learning, where a world model incorrectly ranks one policy over another contrary to the true environment. The authors analogize this to reward hacking but find that the inevitability proof for hacking does not transfer. They develop a general theory showing exploitation is unavoidable on large policy sets, and that conditions preventing hacking in finite sets do not preclude exploitation. A relaxed notion of exploitation is introduced with a safe horizon for avoidance.
Key facts
- arXiv paper 2605.15960 proposes a definition of model exploitation in reinforcement learning.
- Model exploitation occurs when a world model implies one policy is strictly preferred over another, but the true environment implies the reverse.
- The definition is analogized to reward hacking, but the inevitability proof does not transfer.
- A general theory proves exploitation is unavoidable on large policy sets.
- Conditions that guarantee unhackability in finite policy sets do not preclude exploitation.
- A relaxed notion of exploitation is introduced with a safe horizon within which it can be avoided.
Entities
Institutions
- arXiv