Coral: Cost-Efficient Multi-LLM Serving on Heterogeneous Cloud GPUs

ai-technology · 2026-05-07

Coral is an adaptive heterogeneity-aware multi-LLM serving system designed to efficiently harness diverse cloud GPUs for concurrent model serving. It jointly optimizes resource allocation and serving strategies across multiple models, using a lossless two-stage decomposition to reduce online solve time from hours to tens of seconds. Evaluated on 6 models and 20 GPU configurations, Coral achieves up to 2.79× cost reduction over the best baseline.

Key facts

Coral is a multi-LLM serving system for heterogeneous cloud GPUs.
It jointly optimizes resource allocation and serving strategy across all models.
Uses lossless two-stage decomposition to cut online solve time from hours to tens of seconds.
Evaluated on 6 models and 20 GPU configurations.
Reduces serving cost by up to 2.79× over the best baseline.
Addresses fragmented LLM usage and diverse cloud GPU availability.
Targets mid-tier and older-generation GPUs with better availability.
Preserves joint optimality while reducing computational overhead.

Entities

—

Sources

arXiv cs.AI — 2026-05-07