POLAR Framework Optimizes LoRA Adapter Caching for Edge LLM Serving

ai-technology · 2026-04-22

A new research paper introduces POLAR (Paging and Online Learning for Adapter Routing), a system designed to optimize edge deployment of large language models using LoRA adapters. The framework addresses the challenge of limited GPU/DRAM capacity that can only accommodate a small subset of adapters at any given time. When requests require non-resident adapters, loading weights from storage introduces significant latency delays. POLAR formulates this as a two-timescale contextual bandit problem, combining cache-aware LinUCB routing with epoch-based cache control. The system manages adapter residency on a slow timescale while routing requests on a fast timescale, where adapter utility depends on unknown contextual factors. This joint approach allows the cache to influence exploration costs while the router determines which adapters receive feedback. The research was published on arXiv under identifier 2604.16583v1 with a cross announcement type.

Key facts

POLAR addresses LoRA adapter caching in edge LLM serving
Limited GPU/DRAM capacity restricts resident adapter subsets
Non-resident adapters require weight paging from storage
System formulates caching and routing as two-timescale contextual bandit
Combines cache-aware LinUCB router with epoch-based cache controller
Slow timescale manages adapter residency in fast memory
Fast timescale routes requests to context-dependent adapters
Research published on arXiv as 2604.16583v1

POLAR Framework Optimizes LoRA Adapter Caching for Edge LLM Serving

Key facts

Entities

Institutions

Sources