ReCoVer: Fault-Tolerant System for LLM Pre-Training
A new system called ReCoVer addresses hardware faults during large language model pre-training on GPU clusters. It maintains constant microbatch counts per iteration to keep gradients equivalent to failure-free runs. The framework has three decoupled protocol layers: fault-tolerant collectives, in-step fine-grained recovery, and versatile-workload policy for dynamic redistribution. It is parallelism-agnostic and integrates directly with existing frameworks.
Key facts
- ReCoVer is a resilient LLM pre-training system.
- It keeps the number of microbatches constant per iteration.
- It ensures per-iteration gradients are stochastically equivalent to failure-free runs.
- The framework has three decoupled protocol layers.
- Fault-tolerant collectives isolate faults across replicas.
- In-step fine-grained recovery preserves intra-iteration progress.
- Versatile-workload policy dynamically redistributes microbatch quotas.
- The design is parallelism-agnostic.
Entities
—