PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

ai-technology · 2026-05-22

Researchers have introduced a novel system named PALS (Power-Aware LLM Serving), which utilizes GPU power limits as a manageable asset to enhance energy efficiency during large language model inference. Integrated into the vLLM framework, PALS merges simple offline power-performance models with a feedback-based controller to choose settings that achieve throughput goals while reducing energy use. This system does not necessitate any model retraining or alterations to APIs and has been validated on multi-GPU setups using both dense and mixture-of-experts (MoE) models, showing significant gains in energy efficiency.

Key facts

PALS is a power-aware runtime for LLM serving.
It treats GPU power caps as a first-class control knob.
It jointly optimizes power caps with software parameters like batch size.
The system uses lightweight offline power-performance models.
It employs a feedback-driven controller to select configurations.
PALS is implemented within the vLLM framework.
It requires no model retraining or API changes.
Tested on multi-GPU systems with dense and MoE models.

Entities

—

Sources

arXiv cs.AI — 2026-05-21