SORT: Selective Off-Policy Reference Tuning Improves Reasoning in LLMs

ai-technology · 2026-05-13

Researchers have unveiled Selective Off-Policy Reference Tuning (SORT), a technique that improves reinforcement learning by utilizing verifiable rewards for large language models. SORT tackles a significant drawback of GRPO-style approaches, which struggle with challenging prompts where every sampled rollout fails. Instead of altering the generation of rollouts, SORT implements a repair update: it formulates a plan based on the reference solution, evaluates token probabilities both with and without that plan, and prioritizes tokens that become more predictable when conditioned on the plan. This approach converts completely incorrect prompts into selective, structure-aware learning signals rather than simple imitation. SORT demonstrates enhancements over GRPO and guidance baselines across three backbones and eight reasoning benchmarks, particularly benefiting weaker models.

Key facts

SORT stands for Selective Off-Policy Reference Tuning.
It is designed for reinforcement learning with verifiable rewards.
GRPO-style methods stall on hard prompts where all sampled rollouts fail.
SORT adds a repair update for those failures without changing rollout generation.
It derives a plan from the reference solution.
It compares token probabilities with and without that plan.
Higher weight is given to tokens that become more predictable under plan conditioning.
SORT was tested across three backbones and eight reasoning benchmarks.
It improves over GRPO and guidance baselines.
Largest gains were on weaker models.

Entities

—

Sources

arXiv cs.AI — 2026-05-13