Reinforcement Learning Optimizes LLM Stepwise Model Routing for Cost-Efficiency
A new arXiv preprint (2605.06116) introduces a reinforcement learning approach for stepwise model routing in large language models (LLMs) to balance reasoning accuracy and inference cost. The method trains a small control policy using RL and threshold calibration, treating routing as a constrained decision-making problem. It outperforms handcrafted routing strategies on math benchmarks GSM8K, MATH500, and OmniMath, achieving a comparable accuracy-cost tradeoff across both open and closed models.
Key facts
- arXiv preprint 2605.06116 proposes policy-guided stepwise model routing for LLMs.
- Method uses reinforcement learning and threshold calibration to optimize cost-efficiency.
- Validated on GSM8K, MATH500, and OmniMath benchmarks.
- Outperforms handcrafted routing strategies.
- Applicable to both open and closed models.
- Formulates routing as a constrained decision-making problem.
- Avoids training large process reward models.
- Focuses on inference-time computation for reasoning tasks.
Entities
Institutions
- arXiv