ARTFEED — Contemporary Art Intelligence

POISE: Efficient RLVR for LLMs Using Internal State Value Estimation

ai-technology · 2026-05-11

A new reinforcement learning method called Policy Optimization with Internal State Value Estimation (POISE) reduces the computational cost of training large reasoning models. Unlike PPO, which requires a separate critic model, or GRPO, which needs multiple rollouts per prompt, POISE uses the policy model's own internal signals—hidden states and token-entropy statistics—to predict expected verifiable rewards. A lightweight probe trained online estimates these values, and a cross-rollout construction preserves gradient unbiasedness. The approach promises variance reduction at negligible extra cost.

Key facts

  • POISE stands for Policy Optimization with Internal State Value Estimation
  • It uses the policy model's internal signals for baseline estimation
  • Avoids the need for a policy-model scale critic like PPO
  • Avoids multiple rollouts per prompt like GRPO
  • A lightweight probe predicts expected verifiable reward from hidden states and token-entropy statistics
  • Cross-rollout construction ensures gradient unbiasedness
  • The method is designed for reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models
  • The paper is available on arXiv with ID 2605.07579

Entities

Institutions

  • arXiv

Sources