ARTFEED — Contemporary Art Intelligence

veScale-FSDP: Flexible and High-Performance FSDP at Scale

ai-technology · 2026-04-24

The newly developed veScale-FSDP system tackles the shortcomings of current Fully Sharded Data Parallel (FSDP) methods used in extensive model training. Existing FSDP frameworks depend on rigid element-wise or row-wise sharding formats, which are incompatible with block-structured computations. This limitation obstructs the use of modern training techniques such as block-wise quantization and non-element-wise optimizers like Shampoo and Muon, while also leading to significant communication and memory overheads when utilizing tens of thousands of GPUs. veScale-FSDP features RaggedShard, an adaptable sharding format, paired with a structure-aware planning algorithm that facilitates zero-copy FSDP communications and inherently supports block-wise quantization, aiming to enhance both performance and flexibility in large-scale distributed training.

Key facts

  • veScale-FSDP is a novel FSDP system.
  • It addresses limitations of existing FSDP systems.
  • Existing FSDP relies on fixed element-wise or row-wise sharding.
  • Fixed sharding conflicts with block-structured computations.
  • veScale-FSDP uses RaggedShard, a flexible sharding format.
  • It includes a structure-aware planning algorithm.
  • Enables zero-copy FSDP communications.
  • Natively supports block-wise quantization.
  • Supports non-element-wise optimizers like Shampoo and Muon.
  • Targets training at tens of thousands of GPUs.

Entities

Sources