StepOPSD: Step-Aware Online Preference Distillation for Agent RL

other · 2026-05-27

StepOPSD is a framework designed for self-distillation after rollout, specifically for multi-turn agent reinforcement learning. It tackles the issue of credit-assignment mismatch by breaking down trajectories into action-focused segments. This approach rescales actions based on hindsight-informed teacher contexts and transforms token-level log-probability discrepancies into advantage shaping that preserves signs, while maintaining a normalized credit budget for each step prior to the GRPO update. Evaluated on ALFWorld and Search-QA using Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD has achieved either the best or second-best performance on subsets that are particularly sensitive to local decision-making.

Key facts

StepOPSD is a post-rollout preference self-distillation framework for multi-turn agent reinforcement learning.
It addresses credit-assignment mismatch by decomposing trajectories into action-centered step segments.
It rescores steps under hindsight-enriched teacher contexts.
It converts token-level log-probability gaps into sign-preserving advantage shaping.
It uses a normalized per-step credit budget before GRPO update.
Tested on ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct.
Achieves best or second-best results on subsets most sensitive to local decisions.
Published on arXiv with ID 2605.27140.

StepOPSD: Step-Aware Online Preference Distillation for Agent RL

Key facts

Entities

Institutions

Sources