POISE: Efficient RLVR for LLMs Using Internal State Value Estimation
A new reinforcement learning method called Policy Optimization with Internal State Value Estimation (POISE) reduces the computational cost of training large reasoning models. Unlike PPO, which requires a separate critic model, or GRPO, which needs multiple rollouts per prompt, POISE uses the policy model's own internal signals—hidden states and token-entropy statistics—to predict expected verifiable rewards. A lightweight probe trained online estimates these values, and a cross-rollout construction preserves gradient unbiasedness. The approach promises variance reduction at negligible extra cost.
Key facts
- POISE stands for Policy Optimization with Internal State Value Estimation
- It uses the policy model's internal signals for baseline estimation
- Avoids the need for a policy-model scale critic like PPO
- Avoids multiple rollouts per prompt like GRPO
- A lightweight probe predicts expected verifiable reward from hidden states and token-entropy statistics
- Cross-rollout construction ensures gradient unbiasedness
- The method is designed for reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models
- The paper is available on arXiv with ID 2605.07579
Entities
Institutions
- arXiv