ARTFEED — Contemporary Art Intelligence

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

other · 2026-06-01

A new reinforcement learning method, LongTraceRL, addresses long-context reasoning in large language models by using search agent trajectories to create tiered distractors and rubric-based rewards. The approach generates multi-hop questions via knowledge graph random walks and leverages search agent trajectories to build high-confusability distractors from documents read but not cited, and low-confusability distractors from unopened search results. This produces more challenging training contexts than random sampling or one-shot search. The rubric reward provides intermediate supervision for reasoning steps, overcoming the limitations of sparse outcome-only rewards. The paper is available on arXiv under ID 2605.31584.

Key facts

  • LongTraceRL is a reinforcement learning method for long-context reasoning.
  • It uses search agent trajectories to build tiered distractors.
  • High-confusability distractors come from documents read but not cited.
  • Low-confusability distractors come from unopened search results.
  • Multi-hop questions are generated via knowledge graph random walks.
  • Rubric rewards supervise intermediate reasoning steps.
  • The method addresses limitations of RLVR with low-confusability distractors and sparse rewards.
  • The paper is available on arXiv with ID 2605.31584.

Entities

Institutions

  • arXiv

Sources