ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent RL

ai-technology · 2026-05-25

Researchers propose ARMS, a self-supervised reward shaping framework for multi-agent reinforcement learning (MARL) that addresses sparse rewards by learning dense shaping signals from trajectory ranking. The method reformulates policy invariance through conditional best-response reasoning, proving that under certain conditions, shaping rewards preserve each agent's best-response set and the set of Nash equilibria. This preserves the strategic structure of the problem, unlike standard reward shaping that may only improve short-term optimization. The work is presented in arXiv paper 2605.23562.

Key facts

ARMS stands for Automatic Reward-shaping in Multi-agent Systems.
It is a self-supervised framework for MARL.
It learns dense shaping signals from sparse environmental rewards.
Trajectory ranking is used to generate shaping signals.
Single-agent guarantees do not directly transfer to MARL.
The framework uses conditional best-response reasoning.
Shaping rewards preserve each agent's best-response set under fixed opponent policies.
The set of Nash equilibria is preserved under certain conditions.

Entities

—

Sources

arXiv cs.AI — 2026-05-25