CR²: Cost-Aware Routing for Edge LLM Inference

ai-technology · 2026-05-13

So, there's this new framework called CR², which stands for Cost-Aware Risk-Controlled Routing, designed to help with the deployment of large language models (LLMs) on mobile edge devices. As LLMs move from centralized cloud systems to edge environments, it’s crucial to find a good balance between latency, energy consumption, and accuracy. CR² works in two stages: it has a lightweight margin gate on the device and an edge-side utility selector for handling deferred queries. The margin gate uses fixed query embeddings and a cost weight defined by the user to decide if it’s better to run things locally. This approach takes into account varying latency and energy costs in wireless settings, unlike older routers that mainly optimize for centralized cloud use.

Key facts

CR² is a two-stage device-edge routing framework for LLM inference.
It decouples an on-device margin gate from an edge-side utility selector.
The margin gate uses frozen query embeddings and a user-specified cost weight.
Existing routers are designed for centralized cloud settings.
CR² captures dynamic latency and energy overheads in wireless edge deployments.
The paper formulates mobile edge LLM routing as a deployment-constrained, cost-aware decision problem.
LLMs are moving from centralized clouds to mobile edge environments.
Efficient serving must balance latency, energy consumption, and accuracy.

Entities

—

Sources

arXiv cs.AI — 2026-05-13