BEACON: Milestone-Guided RL for Long-Horizon Language Agents
Researchers have developed BEACON, a milestone-guided policy learning framework that addresses credit misattribution and sample inefficiency in reinforcement learning for long-horizon language agent tasks. By partitioning trajectories at milestone boundaries and applying temporal reward shaping, BEACON enables precise credit assignment. On benchmarks ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms existing methods GRPO and GiGPO, particularly on long-horizon tasks.
Key facts
- BEACON is a milestone-guided policy learning framework for language agents.
- It addresses credit misattribution and sample inefficiency in RL.
- Partitions trajectories at milestone boundaries.
- Applies temporal reward shaping within segments.
- Estimates advantages at dual scales.
- Outperforms GRPO and GiGPO on ALFWorld, WebShop, and ScienceWorld.
- Particularly effective on long-horizon ALFWorld tasks.
- Introduced in arXiv paper 2605.06078.
Entities
Institutions
- arXiv