Fluid-Guided Scheduling Optimizes LLM Inference with Memory Constraints

ai-technology · 2026-05-18

A new paper on arXiv (2504.11320) introduces WAIT and Nested WAIT, threshold-based admission rules for LLM inference scheduling. The research addresses endogenous memory growth in Key-Value caches, which can evict in-progress requests and waste computation. Providers face costs exceeding $700,000 per day serving millions of users. The fluid model characterizes equilibrium batch composition, memory requirement, and stability region. WAIT handles known output lengths, while Nested WAIT extends to unknown lengths by regulating request advancement across decode stages. The work formulates inference as a multi-stage online scheduling problem with linear iteration times and GPU-resident KV-cache constraints.

Key facts

arXiv paper 2504.11320 introduces WAIT and Nested WAIT scheduling rules
LLM providers incur costs exceeding $700,000 per day
Endogenous memory growth in KV cache can evict in-progress requests
Fluid model characterizes equilibrium batch composition and stability region
WAIT is a threshold-based admission rule for known output lengths
Nested WAIT extends to unknown output lengths
Inference formulated as multi-stage online scheduling problem
GPU-resident KV-cache constraints are central to the model

Fluid-Guided Scheduling Optimizes LLM Inference with Memory Constraints

Key facts

Entities

Institutions

Sources