LLM Serving Needs Mathematical Optimization, Not Heuristics
A recent position paper published on arXiv asserts that the inference serving for LLMs has surpassed the capabilities of standard heuristics, necessitating a shift towards mathematical optimization and foundational algorithms. Although there have been significant developments in serving systems like vLLM and SGLang, their underlying algorithms still utilize traditional distributed computing strategies, such as round-robin and join-shortest-queue for request management, FIFO for scheduling, and LRU for cache eviction. These broad policies overlook the unique aspects of LLM inference, such as the dynamic nature of KV cache memory, asymmetries in the prefill-decode phase, unpredictable output lengths, and continuous batching limitations. The authors argue for the creation of mathematical models that accurately reflect these features to establish algorithms with guaranteed performance across various workloads, rather than depending on heuristics that may be inconsistent.
Key facts
- Paper argues LLM serving needs mathematical optimization, not just heuristics.
- Current systems like vLLM and SGLang use classical distributed computing policies.
- Policies include join-shortest-queue, round-robin, FIFO, and LRU.
- LLM inference has unique characteristics: dynamic KV cache, prefill-decode asymmetry, unknown output lengths, continuous batching.
- Paper calls for mathematical models with provable performance guarantees.
- Heuristics may succeed in some scenarios but fail unpredictably.
- Published on arXiv with ID 2605.01280.
- Emerging work at the intersection of optimization and LLM serving is noted.
Entities
Institutions
- arXiv