LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
A new reinforcement learning method, LongTraceRL, addresses long-context reasoning in large language models by using search agent trajectories to create tiered distractors and rubric-based rewards. The approach generates multi-hop questions via knowledge graph random walks and leverages search agent trajectories to build high-confusability distractors from documents read but not cited, and low-confusability distractors from unopened search results. This produces more challenging training contexts than random sampling or one-shot search. The rubric reward provides intermediate supervision for reasoning steps, overcoming the limitations of sparse outcome-only rewards. The paper is available on arXiv under ID 2605.31584.
Key facts
- LongTraceRL is a reinforcement learning method for long-context reasoning.
- It uses search agent trajectories to build tiered distractors.
- High-confusability distractors come from documents read but not cited.
- Low-confusability distractors come from unopened search results.
- Multi-hop questions are generated via knowledge graph random walks.
- Rubric rewards supervise intermediate reasoning steps.
- The method addresses limitations of RLVR with low-confusability distractors and sparse rewards.
- The paper is available on arXiv with ID 2605.31584.
Entities
Institutions
- arXiv