T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
A new paper on arXiv (2605.02178) introduces Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework designed to stabilize multi-turn reinforcement learning for reasoning LLMs. The authors argue that instability in multi-turn RL often stems from inefficient exploration, where policies generate low-information actions that fail to reduce uncertainty or advance task progress. T$^2$PO addresses this by controlling exploration at two fine-grained levels: at the token level, it monitors uncertainty dynamics and triggers a thinking intervention when marginal uncertainty change falls below a threshold; at the turn level, it identifies interactions with negligible exploration progress. The work aims to improve training stability and prevent collapse in complex interactive tasks.
Key facts
- Paper is on arXiv with ID 2605.02178
- Proposes T$^2$PO (Token- and Turn-level Policy Optimization)
- Addresses instability in multi-turn reinforcement learning
- Instability attributed to inefficient exploration
- Token-level monitoring of uncertainty dynamics
- Turn-level identification of low-progress interactions
- Aims to prevent training collapse
- Focuses on reasoning LLMs
Entities
Institutions
- arXiv