veScale-FSDP: Flexible and High-Performance FSDP at Scale
The newly developed veScale-FSDP system tackles the shortcomings of current Fully Sharded Data Parallel (FSDP) methods used in extensive model training. Existing FSDP frameworks depend on rigid element-wise or row-wise sharding formats, which are incompatible with block-structured computations. This limitation obstructs the use of modern training techniques such as block-wise quantization and non-element-wise optimizers like Shampoo and Muon, while also leading to significant communication and memory overheads when utilizing tens of thousands of GPUs. veScale-FSDP features RaggedShard, an adaptable sharding format, paired with a structure-aware planning algorithm that facilitates zero-copy FSDP communications and inherently supports block-wise quantization, aiming to enhance both performance and flexibility in large-scale distributed training.
Key facts
- veScale-FSDP is a novel FSDP system.
- It addresses limitations of existing FSDP systems.
- Existing FSDP relies on fixed element-wise or row-wise sharding.
- Fixed sharding conflicts with block-structured computations.
- veScale-FSDP uses RaggedShard, a flexible sharding format.
- It includes a structure-aware planning algorithm.
- Enables zero-copy FSDP communications.
- Natively supports block-wise quantization.
- Supports non-element-wise optimizers like Shampoo and Muon.
- Targets training at tens of thousands of GPUs.
Entities
—