DARTS: Accelerating LLM Reinforcement Learning via Distribution-Aware Trajectory Shaping
A novel approach known as DARTS (Distribution-Aware Active Rollout Trajectory Shaping) tackles the inefficiencies in rollout processes within reinforcement learning for extensive language models. These inefficiencies arise from the long-tail distributions of response lengths, which previous methods have attempted to alleviate through prompt-level tail scheduling. DARTS addresses the fundamental issue by analyzing long-tail distributions more precisely, pinpointing intra-prompt long tails that often include unnecessary verbosity. It introduces active distribution shaping to streamline rollout distributions, thereby minimizing overhead caused by tails. This is accomplished through a distribution-aware trajectory sampling technique that picks trajectories from redundant exploration areas for each prompt, along with an adaptive redundancy allocation strategy to optimize shaping. The methodology is elaborated in a paper available on arXiv (2605.30859).
Key facts
- DARTS stands for Distribution-Aware Active Rollout Trajectory Shaping
- It targets rollout efficiency bottlenecks in LLM reinforcement learning
- Inefficiency is due to long-tail response length distribution
- Existing works use prompt-level tail scheduling
- DARTS identifies intra-prompt long tails with ineffective verbosity
- It uses active distribution shaping for conciseness and certainty
- Includes distribution-aware trajectory sampling mechanism
- Includes adaptive redundancy allocation scheme
- Paper available on arXiv with ID 2605.30859
Entities
Institutions
- arXiv