ARTFEED — Contemporary Art Intelligence

Multi-Model LLM Scheduling Under GPU Memory Constraints

ai-technology · 2026-05-20

A new arXiv preprint (2605.19593) presents an empirical study on scheduling multiple Large Language Models (LLMs) on shared heterogeneous hardware. The research focuses on the performance impact of partial CPU-GPU offloading and preemption, revealing that offloading causes non-linear, model-dependent degradation in decode throughput, with smaller models more sensitive to reduced GPU residency. The study highlights the lack of existing work on multi-model scheduling under memory constraints and provides insights for future scheduler design.

Key facts

  • arXiv preprint 2605.19593
  • Study on multi-model LLM scheduling
  • Focus on offloading and preemption
  • Non-linear decode throughput degradation
  • Smaller models more sensitive to offloading
  • Shared heterogeneous hardware context
  • GPU memory constraints
  • Empirical study across hardware platforms

Entities

Institutions

  • arXiv

Sources