Decentralized Q-Learning for Multi-Agent Workflow Handoffs
A recent preprint on arXiv presents a structured approach to workflow learning within multi-agent systems, where specialized agents transfer control via a common artifact while only accessing local data. The study introduces an interface-constrained semi-Markov decision process (IC-SMDP) that features decision points occurring at handoff intervals. The researchers also introduce IC-Q, an asynchronous decentralized Q-learning method that restricts inter-agent coordination to a single scalar at each handoff. Additionally, a finite-sample bound for neural IC-Q is derived, breaking down error into three distinct components: neural function approximation, interface representation gap, and mixing-time residual based on random option-duration discount. This research is relevant to multi-agent LLM pipelines that function across trust or organizational boundaries, lacking a centralized learner to utilize joint trajectories.
Key facts
- arXiv:2605.19140v1
- Published on arXiv
- Introduces IC-SMDP framework
- Proposes IC-Q algorithm
- Coordination limited to one scalar per handoff
- Finite-sample bound for neural IC-Q
- Three error sources identified
- Targets multi-agent LLM pipelines
Entities
Institutions
- arXiv