ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent RL
Researchers propose ARMS, a self-supervised reward shaping framework for multi-agent reinforcement learning (MARL) that addresses sparse rewards by learning dense shaping signals from trajectory ranking. The method reformulates policy invariance through conditional best-response reasoning, proving that under certain conditions, shaping rewards preserve each agent's best-response set and the set of Nash equilibria. This preserves the strategic structure of the problem, unlike standard reward shaping that may only improve short-term optimization. The work is presented in arXiv paper 2605.23562.
Key facts
- ARMS stands for Automatic Reward-shaping in Multi-agent Systems.
- It is a self-supervised framework for MARL.
- It learns dense shaping signals from sparse environmental rewards.
- Trajectory ranking is used to generate shaping signals.
- Single-agent guarantees do not directly transfer to MARL.
- The framework uses conditional best-response reasoning.
- Shaping rewards preserve each agent's best-response set under fixed opponent policies.
- The set of Nash equilibria is preserved under certain conditions.
Entities
—