Multi-Model LLM Scheduling Under GPU Memory Constraints

ai-technology · 2026-05-20

A new arXiv preprint (2605.19593) presents an empirical study on scheduling multiple Large Language Models (LLMs) on shared heterogeneous hardware. The research focuses on the performance impact of partial CPU-GPU offloading and preemption, revealing that offloading causes non-linear, model-dependent degradation in decode throughput, with smaller models more sensitive to reduced GPU residency. The study highlights the lack of existing work on multi-model scheduling under memory constraints and provides insights for future scheduler design.

Key facts

arXiv preprint 2605.19593
Study on multi-model LLM scheduling
Focus on offloading and preemption
Non-linear decode throughput degradation
Smaller models more sensitive to offloading
Shared heterogeneous hardware context
GPU memory constraints
Empirical study across hardware platforms

Multi-Model LLM Scheduling Under GPU Memory Constraints

Key facts

Entities

Institutions

Sources