ReuseRL: Skill Compression Improves Agentic RL Generalization
A new framework called ReuseRL has been developed by researchers, which anchors agentic reinforcement learning in the Minimum Description Length (MDL) principle. This framework creates a common skill dictionary derived from effective trajectories and imposes penalties on unique behaviors through a segmentation cost. A PAC-Bayes generalization bound has been established. Testing on ALFWorld, TextWorld-Cooking, and Countdown-Stepwise indicates that ReuseRL enhances both in- and out-of-distribution success compared to standard GRPO and round-length benchmarks.
Key facts
- ReuseRL grounds agentic RL in the Minimum Description Length (MDL) principle.
- It extracts a shared skill dictionary from successful trajectories.
- The RL objective is augmented with a segmentation cost penalizing idiosyncratic behaviors.
- A PAC-Bayes generalization bound is proven for the compression penalty.
- Evaluated on ALFWorld, TextWorld-Cooking, and Countdown-Stepwise.
- Improves in- and out-of-distribution success over vanilla GRPO and round-length baselines.
- Large language model agents trained with RL often learn brittle, task-specific shortcuts.
- The hypothesis is that agents generalize better when successful trajectories are structurally compressible.
Entities
Institutions
- arXiv