veScale-FSDP: Flexible and High-Performance FSDP at Scale

ai-technology · 2026-04-24

The newly developed veScale-FSDP system tackles the shortcomings of current Fully Sharded Data Parallel (FSDP) methods used in extensive model training. Existing FSDP frameworks depend on rigid element-wise or row-wise sharding formats, which are incompatible with block-structured computations. This limitation obstructs the use of modern training techniques such as block-wise quantization and non-element-wise optimizers like Shampoo and Muon, while also leading to significant communication and memory overheads when utilizing tens of thousands of GPUs. veScale-FSDP features RaggedShard, an adaptable sharding format, paired with a structure-aware planning algorithm that facilitates zero-copy FSDP communications and inherently supports block-wise quantization, aiming to enhance both performance and flexibility in large-scale distributed training.

Key facts

veScale-FSDP is a novel FSDP system.
It addresses limitations of existing FSDP systems.
Existing FSDP relies on fixed element-wise or row-wise sharding.
Fixed sharding conflicts with block-structured computations.
veScale-FSDP uses RaggedShard, a flexible sharding format.
It includes a structure-aware planning algorithm.
Enables zero-copy FSDP communications.
Natively supports block-wise quantization.
Supports non-element-wise optimizers like Shampoo and Muon.
Targets training at tens of thousands of GPUs.

Entities

—

Sources

arXiv cs.AI — 2026-04-23