ARTFEED — Contemporary Art Intelligence

Coral: Cost-Efficient Multi-LLM Serving on Heterogeneous Cloud GPUs

ai-technology · 2026-05-07

Coral is an adaptive heterogeneity-aware multi-LLM serving system designed to efficiently harness diverse cloud GPUs for concurrent model serving. It jointly optimizes resource allocation and serving strategies across multiple models, using a lossless two-stage decomposition to reduce online solve time from hours to tens of seconds. Evaluated on 6 models and 20 GPU configurations, Coral achieves up to 2.79× cost reduction over the best baseline.

Key facts

  • Coral is a multi-LLM serving system for heterogeneous cloud GPUs.
  • It jointly optimizes resource allocation and serving strategy across all models.
  • Uses lossless two-stage decomposition to cut online solve time from hours to tens of seconds.
  • Evaluated on 6 models and 20 GPU configurations.
  • Reduces serving cost by up to 2.79× over the best baseline.
  • Addresses fragmented LLM usage and diverse cloud GPU availability.
  • Targets mid-tier and older-generation GPUs with better availability.
  • Preserves joint optimality while reducing computational overhead.

Entities

Sources