UniScale: Unified Inference Scaling for LLMs via Joint Model Routing and Test-Time Optimization
A new paper on arXiv (2605.30898) introduces Unified Inference Scaling (UIS), a framework that jointly optimizes model routing and test-time scaling (TTS) for large language models (LLMs). Current approaches treat these as separate dimensions: model routing switches among models of different scales based on request complexity, while TTS adjusts compute within a fixed model. This decoupling leads to coarse-grained performance changes from routing and diminishing returns from TTS. UIS unifies both mechanisms into a single optimization problem, enabling adaptive inference that balances quality and cost more effectively. The method addresses limitations in dynamic deployment environments by allowing fine-grained control across model scales and compute budgets simultaneously.
Key facts
- Paper arXiv:2605.30898 introduces Unified Inference Scaling (UIS).
- UIS jointly optimizes model routing and test-time scaling (TTS).
- Existing approaches treat routing and TTS as independent dimensions.
- Model routing provides coarse-grained performance changes due to sparse model scales.
- Single-model TTS encounters capacity ceilings and diminishing returns.
- UIS aims to overcome limitations of decoupled design.
- The framework targets real-world LLM deployments balancing inference quality and cost.
- UIS enables adaptive inference in dynamic environments.
Entities
Institutions
- arXiv