ScheduleFree+ Outperforms WSD Schedules in LLM Training
A new machine learning method, ScheduleFree+, extends Schedule-Free Learning to large language models (LLMs) by addressing scaling issues with larger batch sizes and model sizes. The method eliminates the need for learning rate schedules and outperforms Warmup-Stable-Decay (WSD) schedules. At 1000 tokens per parameter, it achieves a 31% improvement over state-of-the-art schedules. The approach provides a theoretical foundation for model averaging and checkpoint merging during pretraining.
Key facts
- ScheduleFree+ is a learning-rate-free and schedule-free method for training LLMs.
- It outperforms Warmup-Stable-Decay (WSD) schedules.
- At 1000 tokens per parameter, it outperforms SOTA schedules by 31%.
- Schedule-Free Learning has shown success across dozens of standard benchmark problems.
- Strong performance for LLM training was previously only demonstrated at small scales.
- The method provides a theoretical foundation for model averaging and checkpoint merging.
- The paper identifies fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes.
- Schedule-Free Learning is most effective for long duration training.
Entities
Institutions
- arXiv