Plan: Structured Agentic Behavior for Multi-Hop Retrieval
A recent paper on arXiv (2605.28354) presents Plan, a method for structured agentic behavior in multi-hop retrieval. This technique breaks down a question into sequential sub-questions prior to any retrieval actions. Each search phase is tied to a specific sub-question, which helps avoid distractions from partially relevant documents. The research evaluates models ranging from 3B to 14B parameters across three categories and reveals that the same reward signals can lead to different qualitative RL failure modes. This suggests that effective training relies on both the design of rewards and model-specific characteristics. The findings question the common approach of merging reinforcement learning with SFT cold start distilled from a more robust model, underscoring the significance of dependency structures among sub-skills and alternative methods for acquiring capabilities.
Key facts
- arXiv paper 2605.28354 introduces Plan for multi-hop retrieval.
- Plan decomposes questions into ordered sub-questions before retrieval.
- Each search step is anchored to a pre-designed sub-question.
- Models from 3B to 14B parameters across three families were tested.
- Identical reward signals caused different RL failure modes per model.
- Training success depends on reward design and model-specific factors.
- Challenges the paradigm of RL with SFT cold start from stronger models.
- Highlights dependency structure among sub-skills and alternative capability acquisition.
Entities
Institutions
- arXiv