ARTFEED — Contemporary Art Intelligence

HFX System Optimizes LLM Serving with Multi-SLO Support and Fast Scaling

ai-technology · 2026-04-27

The HFX production system has been developed to enhance request scheduling and elastic scaling for large language model (LLM) serving. It effectively tackles the challenge of satisfying various service-level objectives (SLOs) while keeping computational expenses low in dynamic, multi-task environments. HFX features a scheduler that proactively estimates budgets and prioritizes requests to maintain SLO compliance for both new and ongoing tasks. Additionally, it includes a scaler that facilitates rapid device-to-device (D2D) weight transfers, thereby minimizing cold-start latency. HFX accommodates both colocated and disaggregated deployment models. This work is detailed in a paper on arXiv (2508.15919), which points out the shortcomings of current static scheduling methods and single-task strategies.

Key facts

  • HFX jointly optimizes request scheduling and elastic scaling for LLM serving.
  • It addresses strict user-specific SLOs under dynamic, multi-task workloads.
  • The scheduler performs proactive budget estimation and prioritization.
  • The scaler supports fast D2D weight transfer to reduce cold-start latency.
  • HFX supports colocated and disaggregated deployment architectures.
  • Existing approaches rely on static scheduling or single-task settings.
  • The system is designed for production use with heterogeneous requests.
  • The paper is available on arXiv with ID 2508.15919.

Entities

Institutions

  • arXiv

Sources