POLAR Framework Optimizes LoRA Adapter Caching for Edge LLM Serving
A new research paper introduces POLAR (Paging and Online Learning for Adapter Routing), a system designed to optimize edge deployment of large language models using LoRA adapters. The framework addresses the challenge of limited GPU/DRAM capacity that can only accommodate a small subset of adapters at any given time. When requests require non-resident adapters, loading weights from storage introduces significant latency delays. POLAR formulates this as a two-timescale contextual bandit problem, combining cache-aware LinUCB routing with epoch-based cache control. The system manages adapter residency on a slow timescale while routing requests on a fast timescale, where adapter utility depends on unknown contextual factors. This joint approach allows the cache to influence exploration costs while the router determines which adapters receive feedback. The research was published on arXiv under identifier 2604.16583v1 with a cross announcement type.
Key facts
- POLAR addresses LoRA adapter caching in edge LLM serving
- Limited GPU/DRAM capacity restricts resident adapter subsets
- Non-resident adapters require weight paging from storage
- System formulates caching and routing as two-timescale contextual bandit
- Combines cache-aware LinUCB router with epoch-based cache controller
- Slow timescale manages adapter residency in fast memory
- Fast timescale routes requests to context-dependent adapters
- Research published on arXiv as 2604.16583v1
Entities
Institutions
- arXiv