MAVIC: Correcting Value Estimates for Instruction-Following in Multi-Agent RL

ai-technology · 2026-05-14

Researchers have introduced a novel method called Macro-Action Value Correction for Instruction Compliance (MAVIC) to tackle inconsistencies in value estimation within multi-agent reinforcement learning (MARL). This issue arises when agents are required to adhere to external natural language instructions that disrupt ongoing macro-actions. Traditional Bellman updates link value estimates across various instruction contexts, resulting in failures. MAVIC addresses this by adjusting Bellman backups at the boundaries of instructions, which modifies the bootstrapping target, thus ensuring consistent value estimation during stochastic instruction transitions within a single policy. The method is backed by theoretical analysis and an actor-critic implementation, demonstrating impressive performance. This study is available on arXiv under ID 2605.12655.

Key facts

MAVIC addresses value estimation inconsistencies in MARL with instruction following.
Standard Bellman updates couple value estimates across instruction contexts.
MAVIC corrects Bellman backups at instruction boundaries.
It modifies the bootstrapping target rather than using reward shaping.
The method enables consistent value estimation under stochastic instruction switching.
MAVIC is supported by theoretical analysis and an actor-critic implementation.
The paper is on arXiv with ID 2605.12655.
The approach achieves high performance.

MAVIC: Correcting Value Estimates for Instruction-Following in Multi-Agent RL

Key facts

Entities

Institutions

Sources