SAGA: A Distributed Scheduler for AI Agent Inference on GPU Clusters
A new paper on arXiv (2605.00528) introduces SAGA, a distributed scheduler designed to optimize AI agent inference on GPU clusters. Current GPU schedulers treat each LLM call as independent, discarding intermediate state and inflating latency by 3-8x. SAGA shifts to program-level scheduling, treating the entire agent workflow as a schedulable unit. It uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing, and Agent Fair Share for fairness. The system achieves within 1.31x of Bélády's optimal offline policy.
Key facts
- Paper on arXiv: 2605.00528
- Title: SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
- Current GPU schedulers treat each LLM call as independent, causing 3-8x latency inflation
- SAGA proposes program-level scheduling for entire agent workflows
- Uses Agent Execution Graphs to predict KV cache reuse across tool-call boundaries
- Achieves within 1.31x of Bélády's optimal offline policy
- Implements session-affinity batching with work stealing
- Introduces Agent Fair Share, a task-completion-time fairness metric
Entities
Institutions
- arXiv