ReCoVer: Fault-Tolerant System for LLM Pre-Training

other · 2026-05-13

A new system called ReCoVer addresses hardware faults during large language model pre-training on GPU clusters. It maintains constant microbatch counts per iteration to keep gradients equivalent to failure-free runs. The framework has three decoupled protocol layers: fault-tolerant collectives, in-step fine-grained recovery, and versatile-workload policy for dynamic redistribution. It is parallelism-agnostic and integrates directly with existing frameworks.

Key facts

ReCoVer is a resilient LLM pre-training system.
It keeps the number of microbatches constant per iteration.
It ensures per-iteration gradients are stochastically equivalent to failure-free runs.
The framework has three decoupled protocol layers.
Fault-tolerant collectives isolate faults across replicas.
In-step fine-grained recovery preserves intra-iteration progress.
Versatile-workload policy dynamically redistributes microbatch quotas.
The design is parallelism-agnostic.

Entities

—

Sources

arXiv cs.AI — 2026-05-13