LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

other · 2026-06-01

A new reinforcement learning method, LongTraceRL, addresses long-context reasoning in large language models by using search agent trajectories to create tiered distractors and rubric-based rewards. The approach generates multi-hop questions via knowledge graph random walks and leverages search agent trajectories to build high-confusability distractors from documents read but not cited, and low-confusability distractors from unopened search results. This produces more challenging training contexts than random sampling or one-shot search. The rubric reward provides intermediate supervision for reasoning steps, overcoming the limitations of sparse outcome-only rewards. The paper is available on arXiv under ID 2605.31584.

Key facts

LongTraceRL is a reinforcement learning method for long-context reasoning.
It uses search agent trajectories to build tiered distractors.
High-confusability distractors come from documents read but not cited.
Low-confusability distractors come from unopened search results.
Multi-hop questions are generated via knowledge graph random walks.
Rubric rewards supervise intermediate reasoning steps.
The method addresses limitations of RLVR with low-confusability distractors and sparse rewards.
The paper is available on arXiv with ID 2605.31584.

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Key facts

Entities

Institutions

Sources