ScaleAcross Explorer Optimizes Cross-Data Center AI Training
Meta has released a research paper detailing ScaleAcross Explorer, an optimizer aimed at enhancing communication efficiency during the training of large language models across various data centers. This paper, available on arXiv, tackles the increasing intricacies of distributing GPU resources across multiple locations, a process referred to as "scale-across" training. It identifies three critical design aspects: parallelism placement, scheduling, and network layer technologies. By taking a comprehensive approach to these factors, the optimizer seeks to expedite the exploration of the design landscape and facilitate effective training for cutting-edge models. The findings are based on Meta's extensive experience managing hundreds of thousands of GPUs in numerous data centers.
Key facts
- ScaleAcross Explorer is an optimizer for cross-data center AI training.
- The paper is from Meta and published on arXiv.
- It addresses 'scale-across' training paradigm.
- Three design dimensions are characterized: parallelism placement, parallelism scheduling, network layer technologies.
- The optimizer holistically optimizes these dimensions.
- Meta's production experience with hundreds of thousands of GPUs is leveraged.
- Training jobs are deployed across multiple data center buildings and regions.
- The goal is efficient training for frontier model development.
Entities
Institutions
- Meta
- arXiv