ARTFEED — Contemporary Art Intelligence

ReCoVer: Fault-Tolerant System for LLM Pre-Training

other · 2026-05-13

A new system called ReCoVer addresses hardware faults during large language model pre-training on GPU clusters. It maintains constant microbatch counts per iteration to keep gradients equivalent to failure-free runs. The framework has three decoupled protocol layers: fault-tolerant collectives, in-step fine-grained recovery, and versatile-workload policy for dynamic redistribution. It is parallelism-agnostic and integrates directly with existing frameworks.

Key facts

  • ReCoVer is a resilient LLM pre-training system.
  • It keeps the number of microbatches constant per iteration.
  • It ensures per-iteration gradients are stochastically equivalent to failure-free runs.
  • The framework has three decoupled protocol layers.
  • Fault-tolerant collectives isolate faults across replicas.
  • In-step fine-grained recovery preserves intra-iteration progress.
  • Versatile-workload policy dynamically redistributes microbatch quotas.
  • The design is parallelism-agnostic.

Entities

Sources