TrainMover: Resilient Runtime Reduces LLM Training Downtime

ai-technology · 2026-05-18

TrainMover, a novel runtime, tackles the common disruptions in extensive ML training tasks that arise from software and hardware issues, failures, and management events. Unlike traditional methods like checkpoint-restart or runtime reconfiguration, which often result in prolonged downtimes and reduced performance, TrainMover utilizes elastic and standby machines to minimize interruptions, achieving negligible downtime and no memory overhead. The system features three innovative techniques: a two-phase, delta-based communication group setup; a communication-free sandboxed warmup; and a versatile standby design for recovery from any role. Testing at the 1024-GPU scale reveals that TrainMover maintains approximately 20 seconds of downtime during various interruptions. It is projected to cut wasted GPU hours by 55%, translating to a savings of 1.4 million GPU-hours weekly at the 64K-GPU scale.

Key facts

TrainMover is a resilient LLM training runtime.
It handles interruptions from hardware/software anomalies, failures, and management events.
Existing solutions like checkpoint-restart suffer long downtimes.
TrainMover uses elastic and standby machines.
It introduces three key techniques: two-phase delta-based communication group setup, communication-free sandboxed warmup, and general standby design.
Evaluation at 1024-GPU scale shows ~20 seconds downtime.
Projected to reduce wasted GPU hours by 55%.
Could save 1.4 million GPU-hours per week at 64K-GPU scale.

Entities

—

Sources

arXiv cs.AI — 2026-05-18