SAGA: A Distributed Scheduler for AI Agent Inference on GPU Clusters

ai-technology · 2026-05-04

A new paper on arXiv (2605.00528) introduces SAGA, a distributed scheduler designed to optimize AI agent inference on GPU clusters. Current GPU schedulers treat each LLM call as independent, discarding intermediate state and inflating latency by 3-8x. SAGA shifts to program-level scheduling, treating the entire agent workflow as a schedulable unit. It uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing, and Agent Fair Share for fairness. The system achieves within 1.31x of Bélády's optimal offline policy.

Key facts

Paper on arXiv: 2605.00528
Title: SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Current GPU schedulers treat each LLM call as independent, causing 3-8x latency inflation
SAGA proposes program-level scheduling for entire agent workflows
Uses Agent Execution Graphs to predict KV cache reuse across tool-call boundaries
Achieves within 1.31x of Bélády's optimal offline policy
Implements session-affinity batching with work stealing
Introduces Agent Fair Share, a task-completion-time fairness metric

SAGA: A Distributed Scheduler for AI Agent Inference on GPU Clusters

Key facts

Entities

Institutions

Sources