Decentralized Q-Learning for Multi-Agent Workflow Handoffs

other · 2026-05-20

A recent preprint on arXiv presents a structured approach to workflow learning within multi-agent systems, where specialized agents transfer control via a common artifact while only accessing local data. The study introduces an interface-constrained semi-Markov decision process (IC-SMDP) that features decision points occurring at handoff intervals. The researchers also introduce IC-Q, an asynchronous decentralized Q-learning method that restricts inter-agent coordination to a single scalar at each handoff. Additionally, a finite-sample bound for neural IC-Q is derived, breaking down error into three distinct components: neural function approximation, interface representation gap, and mixing-time residual based on random option-duration discount. This research is relevant to multi-agent LLM pipelines that function across trust or organizational boundaries, lacking a centralized learner to utilize joint trajectories.

Key facts

arXiv:2605.19140v1
Published on arXiv
Introduces IC-SMDP framework
Proposes IC-Q algorithm
Coordination limited to one scalar per handoff
Finite-sample bound for neural IC-Q
Three error sources identified
Targets multi-agent LLM pipelines

Decentralized Q-Learning for Multi-Agent Workflow Handoffs

Key facts

Entities

Institutions

Sources