Multi-Agent RL for LLM Workflows: When Does It Help?

ai-technology · 2026-05-26

A recent preprint on arXiv (2605.24202) explores the conditions under which end-to-end reinforcement learning (RL) training enhances multi-agent LLM workflows compared to baseline models. The research contrasts Shared-Policy training, where a single policy is updated across all roles, with Isolated-Policy training, where each role maintains distinct parameters. The experiments cover Eval-Opt, Voting, and Orch-Workers workflows, alongside math and code tasks, utilizing models with 0.6B, 1.7B, and 4B parameters. Findings indicate that while multi-agent RL generally outperforms base models, the improvements are influenced by the interplay of workflow, task, and scale rather than policy sharing alone. Isolated-Policy often achieves higher peak accuracy but is more susceptible to significant accuracy declines, whereas Shared-Policy training redistributes failures into different qualitative patterns.

Key facts

arXiv preprint 2605.24202 studies multi-agent RL for LLM workflows.
Compares Shared-Policy and Isolated-Policy training.
Experiments use Eval-Opt, Voting, and Orch-Workers workflows.
Tasks include math and code.
Model scales: 0.6B, 1.7B, 4B parameters.
Multi-agent RL usually improves over base models.
Gains depend on workflow, task, and scale.
Isolated-Policy has higher peak accuracy but more accuracy cliffs.
Shared-Policy redistributes failure patterns.

Multi-Agent RL for LLM Workflows: When Does It Help?

Key facts

Entities

Institutions

Sources