POISE: Efficient RLVR for LLMs Using Internal State Value Estimation

ai-technology · 2026-05-11

A new reinforcement learning method called Policy Optimization with Internal State Value Estimation (POISE) reduces the computational cost of training large reasoning models. Unlike PPO, which requires a separate critic model, or GRPO, which needs multiple rollouts per prompt, POISE uses the policy model's own internal signals—hidden states and token-entropy statistics—to predict expected verifiable rewards. A lightweight probe trained online estimates these values, and a cross-rollout construction preserves gradient unbiasedness. The approach promises variance reduction at negligible extra cost.

Key facts

POISE stands for Policy Optimization with Internal State Value Estimation
It uses the policy model's internal signals for baseline estimation
Avoids the need for a policy-model scale critic like PPO
Avoids multiple rollouts per prompt like GRPO
A lightweight probe predicts expected verifiable reward from hidden states and token-entropy statistics
Cross-rollout construction ensures gradient unbiasedness
The method is designed for reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models
The paper is available on arXiv with ID 2605.07579

POISE: Efficient RLVR for LLMs Using Internal State Value Estimation

Key facts

Entities

Institutions

Sources