ARTFEED — Contemporary Art Intelligence

Fluid-Guided Scheduling Optimizes LLM Inference with Memory Constraints

ai-technology · 2026-05-18

A new paper on arXiv (2504.11320) introduces WAIT and Nested WAIT, threshold-based admission rules for LLM inference scheduling. The research addresses endogenous memory growth in Key-Value caches, which can evict in-progress requests and waste computation. Providers face costs exceeding $700,000 per day serving millions of users. The fluid model characterizes equilibrium batch composition, memory requirement, and stability region. WAIT handles known output lengths, while Nested WAIT extends to unknown lengths by regulating request advancement across decode stages. The work formulates inference as a multi-stage online scheduling problem with linear iteration times and GPU-resident KV-cache constraints.

Key facts

  • arXiv paper 2504.11320 introduces WAIT and Nested WAIT scheduling rules
  • LLM providers incur costs exceeding $700,000 per day
  • Endogenous memory growth in KV cache can evict in-progress requests
  • Fluid model characterizes equilibrium batch composition and stability region
  • WAIT is a threshold-based admission rule for known output lengths
  • Nested WAIT extends to unknown output lengths
  • Inference formulated as multi-stage online scheduling problem
  • GPU-resident KV-cache constraints are central to the model

Entities

Institutions

  • arXiv

Sources