HFX System Optimizes LLM Serving with Multi-SLO Support and Fast Scaling

ai-technology · 2026-04-27

The HFX production system has been developed to enhance request scheduling and elastic scaling for large language model (LLM) serving. It effectively tackles the challenge of satisfying various service-level objectives (SLOs) while keeping computational expenses low in dynamic, multi-task environments. HFX features a scheduler that proactively estimates budgets and prioritizes requests to maintain SLO compliance for both new and ongoing tasks. Additionally, it includes a scaler that facilitates rapid device-to-device (D2D) weight transfers, thereby minimizing cold-start latency. HFX accommodates both colocated and disaggregated deployment models. This work is detailed in a paper on arXiv (2508.15919), which points out the shortcomings of current static scheduling methods and single-task strategies.

Key facts

HFX jointly optimizes request scheduling and elastic scaling for LLM serving.
It addresses strict user-specific SLOs under dynamic, multi-task workloads.
The scheduler performs proactive budget estimation and prioritization.
The scaler supports fast D2D weight transfer to reduce cold-start latency.
HFX supports colocated and disaggregated deployment architectures.
Existing approaches rely on static scheduling or single-task settings.
The system is designed for production use with heterogeneous requests.
The paper is available on arXiv with ID 2508.15919.

HFX System Optimizes LLM Serving with Multi-SLO Support and Fast Scaling

Key facts

Entities

Institutions

Sources