Cloudless-Training: Efficient Geo-Distributed ML Framework
The framework known as Cloudless-Training, introduced in arXiv:2303.05330, aims to enhance the efficiency of machine learning training across multiple geographical regions. It tackles two significant issues: inefficient elastic scheduling of cloud resources spanning different regions and the communication overhead associated with training over wide area networks (WAN), which suffer from low bandwidth and considerable fluctuations. Featuring a dual-layer architecture that incorporates both control and physical training planes, the framework facilitates serverless elastic scheduling and communication. Additionally, it presents a dynamic scheduling strategy that adjusts training workflows according to varying conditions. This initiative is particularly relevant for new machine learning applications, including large model training and federated learning.
Key facts
- Cloudless-Training is a framework for geo-distributed ML training.
- It addresses elastic scheduling and WAN communication challenges.
- Uses a two-layer architecture with control and physical training planes.
- Supports serverless elastic scheduling and communication.
- Elastic scheduling strategy adapts to heterogeneity.
- Targets large model training and federated learning.
- Published on arXiv with ID 2303.05330.
- Aims to improve resource utilization and training performance.
Entities
Institutions
- arXiv