GAR: Carbon-Aware LLM Routing via Constrained Optimization
A new framework called Green-Aware Routing (GAR) minimizes CO2 emissions per LLM inference request while maintaining accuracy and latency targets. GAR uses adaptive constraint optimization and lightweight estimators for real-time routing decisions across heterogeneous model pools. The paper introduces GAR-PD, a practical online primal-dual routing algorithm.
Key facts
- GAR is a constrained multi-objective optimization framework for LLM inference routing.
- It minimizes per-request CO2 emissions subject to accuracy floors and p95-latency SLOs.
- GAR employs per-dataset floor tuning and lightweight estimators for correctness, tail latency, and carbon emissions.
- GAR-PD is a practical online primal-dual routing algorithm.
- Current routing methods rarely consider sustainable energy use and CO2 emissions.
- Grid carbon intensity varies by time and region.
- Models differ significantly in energy consumption.
- The paper is published on arXiv with ID 2605.11603.
Entities
Institutions
- arXiv