ARTFEED — Contemporary Art Intelligence

SAGA: A Distributed Scheduler for AI Agent Inference on GPU Clusters

ai-technology · 2026-05-04

A new paper on arXiv (2605.00528) introduces SAGA, a distributed scheduler designed to optimize AI agent inference on GPU clusters. Current GPU schedulers treat each LLM call as independent, discarding intermediate state and inflating latency by 3-8x. SAGA shifts to program-level scheduling, treating the entire agent workflow as a schedulable unit. It uses Agent Execution Graphs to predict KV cache reuse, session-affinity batching with work stealing, and Agent Fair Share for fairness. The system achieves within 1.31x of Bélády's optimal offline policy.

Key facts

  • Paper on arXiv: 2605.00528
  • Title: SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
  • Current GPU schedulers treat each LLM call as independent, causing 3-8x latency inflation
  • SAGA proposes program-level scheduling for entire agent workflows
  • Uses Agent Execution Graphs to predict KV cache reuse across tool-call boundaries
  • Achieves within 1.31x of Bélády's optimal offline policy
  • Implements session-affinity batching with work stealing
  • Introduces Agent Fair Share, a task-completion-time fairness metric

Entities

Institutions

  • arXiv

Sources