StepOPSD: Step-Aware Online Preference Distillation for Agent RL
StepOPSD is a framework designed for self-distillation after rollout, specifically for multi-turn agent reinforcement learning. It tackles the issue of credit-assignment mismatch by breaking down trajectories into action-focused segments. This approach rescales actions based on hindsight-informed teacher contexts and transforms token-level log-probability discrepancies into advantage shaping that preserves signs, while maintaining a normalized credit budget for each step prior to the GRPO update. Evaluated on ALFWorld and Search-QA using Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD has achieved either the best or second-best performance on subsets that are particularly sensitive to local decision-making.
Key facts
- StepOPSD is a post-rollout preference self-distillation framework for multi-turn agent reinforcement learning.
- It addresses credit-assignment mismatch by decomposing trajectories into action-centered step segments.
- It rescores steps under hindsight-enriched teacher contexts.
- It converts token-level log-probability gaps into sign-preserving advantage shaping.
- It uses a normalized per-step credit budget before GRPO update.
- Tested on ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct.
- Achieves best or second-best results on subsets most sensitive to local decisions.
- Published on arXiv with ID 2605.27140.
Entities
Institutions
- arXiv