PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

ai-technology · 2026-05-20

A new paper on arXiv (2605.17877) introduces PAIR, a method that repurposes internal correctness probing over LLM hidden states as a step-level reward signal for multi-turn agent optimization. Current LLMs struggle with complex multi-stage tasks, and Group Relative Policy Optimization (GRPO) relies on sparse outcome rewards that limit credit assignment across intermediate steps. Existing solutions like full rollouts, external LLM judges, or intrinsic rewards with ground-truth answers are costly or impractical. The authors hypothesize that hidden-state probes can address these limitations, but show that existing probing research assumes clean inputs, which fails in multi-step settings due to prefix contamination tracking coherence with possibly corrupt prefixes.

Key facts

Paper arXiv:2605.17877 introduces PAIR method
PAIR repurposes internal correctness probing over LLM hidden states as step-level reward signal
Current LLMs struggle with complex multi-stage tasks
GRPO relies on sparse outcome rewards limiting credit assignment
Existing remedies like full rollouts, external LLM judges, or intrinsic rewards are costly or impractical
Hidden-state probes degrade under prefix contamination in multi-step settings
Existing probing research assumes clean inputs, which breaks down in multi-step settings
PAIR addresses prefix contamination tracking coherence with possibly corrupt prefixes

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

Key facts

Entities

Institutions

Sources