Multi-Model LLM Scheduling Under GPU Memory Constraints
A new arXiv preprint (2605.19593) presents an empirical study on scheduling multiple Large Language Models (LLMs) on shared heterogeneous hardware. The research focuses on the performance impact of partial CPU-GPU offloading and preemption, revealing that offloading causes non-linear, model-dependent degradation in decode throughput, with smaller models more sensitive to reduced GPU residency. The study highlights the lack of existing work on multi-model scheduling under memory constraints and provides insights for future scheduler design.
Key facts
- arXiv preprint 2605.19593
- Study on multi-model LLM scheduling
- Focus on offloading and preemption
- Non-linear decode throughput degradation
- Smaller models more sensitive to offloading
- Shared heterogeneous hardware context
- GPU memory constraints
- Empirical study across hardware platforms
Entities
Institutions
- arXiv