Coral: Cost-Efficient Multi-LLM Serving on Heterogeneous Cloud GPUs
Coral is an adaptive heterogeneity-aware multi-LLM serving system designed to efficiently harness diverse cloud GPUs for concurrent model serving. It jointly optimizes resource allocation and serving strategies across multiple models, using a lossless two-stage decomposition to reduce online solve time from hours to tens of seconds. Evaluated on 6 models and 20 GPU configurations, Coral achieves up to 2.79× cost reduction over the best baseline.
Key facts
- Coral is a multi-LLM serving system for heterogeneous cloud GPUs.
- It jointly optimizes resource allocation and serving strategy across all models.
- Uses lossless two-stage decomposition to cut online solve time from hours to tens of seconds.
- Evaluated on 6 models and 20 GPU configurations.
- Reduces serving cost by up to 2.79× over the best baseline.
- Addresses fragmented LLM usage and diverse cloud GPU availability.
- Targets mid-tier and older-generation GPUs with better availability.
- Preserves joint optimality while reducing computational overhead.
Entities
—