HFX System Optimizes LLM Serving with Multi-SLO Support and Fast Scaling
The HFX production system has been developed to enhance request scheduling and elastic scaling for large language model (LLM) serving. It effectively tackles the challenge of satisfying various service-level objectives (SLOs) while keeping computational expenses low in dynamic, multi-task environments. HFX features a scheduler that proactively estimates budgets and prioritizes requests to maintain SLO compliance for both new and ongoing tasks. Additionally, it includes a scaler that facilitates rapid device-to-device (D2D) weight transfers, thereby minimizing cold-start latency. HFX accommodates both colocated and disaggregated deployment models. This work is detailed in a paper on arXiv (2508.15919), which points out the shortcomings of current static scheduling methods and single-task strategies.
Key facts
- HFX jointly optimizes request scheduling and elastic scaling for LLM serving.
- It addresses strict user-specific SLOs under dynamic, multi-task workloads.
- The scheduler performs proactive budget estimation and prioritization.
- The scaler supports fast D2D weight transfer to reduce cold-start latency.
- HFX supports colocated and disaggregated deployment architectures.
- Existing approaches rely on static scheduling or single-task settings.
- The system is designed for production use with heterogeneous requests.
- The paper is available on arXiv with ID 2508.15919.
Entities
Institutions
- arXiv