BEACON: Milestone-Guided RL for Long-Horizon Language Agents

other · 2026-05-09

Researchers have developed BEACON, a milestone-guided policy learning framework that addresses credit misattribution and sample inefficiency in reinforcement learning for long-horizon language agent tasks. By partitioning trajectories at milestone boundaries and applying temporal reward shaping, BEACON enables precise credit assignment. On benchmarks ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms existing methods GRPO and GiGPO, particularly on long-horizon tasks.

Key facts

BEACON is a milestone-guided policy learning framework for language agents.
It addresses credit misattribution and sample inefficiency in RL.
Partitions trajectories at milestone boundaries.
Applies temporal reward shaping within segments.
Estimates advantages at dual scales.
Outperforms GRPO and GiGPO on ALFWorld, WebShop, and ScienceWorld.
Particularly effective on long-horizon ALFWorld tasks.
Introduced in arXiv paper 2605.06078.

BEACON: Milestone-Guided RL for Long-Horizon Language Agents

Key facts

Entities

Institutions

Sources