Multi-Agent RL for LLM Workflows: When Does It Help?
A recent preprint on arXiv (2605.24202) explores the conditions under which end-to-end reinforcement learning (RL) training enhances multi-agent LLM workflows compared to baseline models. The research contrasts Shared-Policy training, where a single policy is updated across all roles, with Isolated-Policy training, where each role maintains distinct parameters. The experiments cover Eval-Opt, Voting, and Orch-Workers workflows, alongside math and code tasks, utilizing models with 0.6B, 1.7B, and 4B parameters. Findings indicate that while multi-agent RL generally outperforms base models, the improvements are influenced by the interplay of workflow, task, and scale rather than policy sharing alone. Isolated-Policy often achieves higher peak accuracy but is more susceptible to significant accuracy declines, whereas Shared-Policy training redistributes failures into different qualitative patterns.
Key facts
- arXiv preprint 2605.24202 studies multi-agent RL for LLM workflows.
- Compares Shared-Policy and Isolated-Policy training.
- Experiments use Eval-Opt, Voting, and Orch-Workers workflows.
- Tasks include math and code.
- Model scales: 0.6B, 1.7B, 4B parameters.
- Multi-agent RL usually improves over base models.
- Gains depend on workflow, task, and scale.
- Isolated-Policy has higher peak accuracy but more accuracy cliffs.
- Shared-Policy redistributes failure patterns.
Entities
Institutions
- arXiv