ReProbe: Efficient Test-Time Scaling for Multi-Step Reasoning via Internal State Probing
Researchers have introduced ReProbe, an efficient technique designed to verify reasoning steps in large language models (LLMs) by examining their internal states. In contrast to Process Reward Models (PRMs), which are resource-intensive and necessitate extensive annotations, ReProbe employs a transformer-based probe consisting of fewer than 10 million parameters. This probe assesses the reliability of each reasoning step during the generation process, utilizing internal states from a static LLM. Annotations can be generated by a more extensive LLM, such as DeepSeek-R1, or through self-supervision by the original model. This method facilitates efficient test-time scaling (TTS) by sampling various reasoning options and choosing the optimal steps for continuation. The approach has been tested across several domains, showing enhanced reasoning capabilities without the burdensome requirements of PRMs. The paper can be found on arXiv with ID 2511.06209.
Key facts
- ReProbe uses internal states of LLMs for step-level reasoning verification.
- The probe is a transformer-based model with fewer than 10 million parameters.
- Annotations can come from a larger LLM like DeepSeek-R1 or be self-supervised.
- It is a lightweight alternative to Process Reward Models (PRMs).
- Test-time scaling improves performance by sampling and verifying reasoning steps.
- The method is evaluated across multiple domains.
- The paper is available on arXiv: 2511.06209.
- The approach uses a frozen LLM for probing.
Entities
Institutions
- arXiv