T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

other · 2026-05-06

A new paper on arXiv (2605.02178) introduces Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework designed to stabilize multi-turn reinforcement learning for reasoning LLMs. The authors argue that instability in multi-turn RL often stems from inefficient exploration, where policies generate low-information actions that fail to reduce uncertainty or advance task progress. T$^2$PO addresses this by controlling exploration at two fine-grained levels: at the token level, it monitors uncertainty dynamics and triggers a thinking intervention when marginal uncertainty change falls below a threshold; at the turn level, it identifies interactions with negligible exploration progress. The work aims to improve training stability and prevent collapse in complex interactive tasks.

Key facts

Paper is on arXiv with ID 2605.02178
Proposes T$^2$PO (Token- and Turn-level Policy Optimization)
Addresses instability in multi-turn reinforcement learning
Instability attributed to inefficient exploration
Token-level monitoring of uncertainty dynamics
Turn-level identification of low-progress interactions
Aims to prevent training collapse
Focuses on reasoning LLMs

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Key facts

Entities

Institutions

Sources