PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
A new paper on arXiv (2605.17877) introduces PAIR, a method that repurposes internal correctness probing over LLM hidden states as a step-level reward signal for multi-turn agent optimization. Current LLMs struggle with complex multi-stage tasks, and Group Relative Policy Optimization (GRPO) relies on sparse outcome rewards that limit credit assignment across intermediate steps. Existing solutions like full rollouts, external LLM judges, or intrinsic rewards with ground-truth answers are costly or impractical. The authors hypothesize that hidden-state probes can address these limitations, but show that existing probing research assumes clean inputs, which fails in multi-step settings due to prefix contamination tracking coherence with possibly corrupt prefixes.
Key facts
- Paper arXiv:2605.17877 introduces PAIR method
- PAIR repurposes internal correctness probing over LLM hidden states as step-level reward signal
- Current LLMs struggle with complex multi-stage tasks
- GRPO relies on sparse outcome rewards limiting credit assignment
- Existing remedies like full rollouts, external LLM judges, or intrinsic rewards are costly or impractical
- Hidden-state probes degrade under prefix contamination in multi-step settings
- Existing probing research assumes clean inputs, which breaks down in multi-step settings
- PAIR addresses prefix contamination tracking coherence with possibly corrupt prefixes
Entities
Institutions
- arXiv