HRBench: A Unified Framework for Evaluating Thinking-Mode Switching in Hybrid-Reasoning LLMs
HRBench has been launched by researchers as a comprehensive evaluation framework aimed at examining thinking-mode switching in hybrid-reasoning large language models (LLMs). These models enable users to manage reasoning effort, balancing the quality of answers with the cost of inference. HRBench categorizes the design space into two dimensions: three families of switching strategies (prompt-based selection, external routing, speculative execution) and four training approaches (training-free, SFT, offline RL, online RL), resulting in 12 distinct evaluation scenarios. The framework assesses these scenarios across six LLMs, ranging from Qwen3.5-2B to Kimi-K2.5-1.1T, along with five reasoning benchmarks in mathematics, science, and coding, while also reimplementing over 12 notable methods. This initiative seeks to standardize comparisons for adaptive thinking-mode selection strategies, which have faced inconsistent evaluations in the past.
Key facts
- HRBench is a unified evaluation framework for thinking-mode switching in hybrid-reasoning LLMs.
- The framework covers three switching strategy families: prompt-based selection, external routing, and speculative execution.
- It includes four training regimes: training-free, SFT, offline RL, and online RL.
- 12 controlled evaluation settings are derived from the combination of strategies and regimes.
- Evaluations are conducted across six LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T.
- Five reasoning benchmarks are used, covering mathematics, science, and code.
- Over 12 representative methods are reimplemented for comparison.
- The goal is to standardize comparison of adaptive thinking-mode selection strategies.
Entities
—