UniScale: Unified Inference Scaling for LLMs via Joint Model Routing and Test-Time Optimization

ai-technology · 2026-06-01

A new paper on arXiv (2605.30898) introduces Unified Inference Scaling (UIS), a framework that jointly optimizes model routing and test-time scaling (TTS) for large language models (LLMs). Current approaches treat these as separate dimensions: model routing switches among models of different scales based on request complexity, while TTS adjusts compute within a fixed model. This decoupling leads to coarse-grained performance changes from routing and diminishing returns from TTS. UIS unifies both mechanisms into a single optimization problem, enabling adaptive inference that balances quality and cost more effectively. The method addresses limitations in dynamic deployment environments by allowing fine-grained control across model scales and compute budgets simultaneously.

Key facts

Paper arXiv:2605.30898 introduces Unified Inference Scaling (UIS).
UIS jointly optimizes model routing and test-time scaling (TTS).
Existing approaches treat routing and TTS as independent dimensions.
Model routing provides coarse-grained performance changes due to sparse model scales.
Single-model TTS encounters capacity ceilings and diminishing returns.
UIS aims to overcome limitations of decoupled design.
The framework targets real-world LLM deployments balancing inference quality and cost.
UIS enables adaptive inference in dynamic environments.

UniScale: Unified Inference Scaling for LLMs via Joint Model Routing and Test-Time Optimization

Key facts

Entities

Institutions

Sources