CCL-D: High-Precision Diagnostic System for Slow/Hang Anomalies in Large-Scale Model Training
A novel diagnostic tool named CCL-D tackles the prevalent issues of slowdowns and hangs in large-scale distributed model training, which are often the most challenging and time-consuming to identify. Conventional diagnostic approaches tend to be both inaccurate and slow, sometimes requiring hours or even days for root cause analysis. CCL-D features a real-time probe at the rank level coupled with an intelligent decision-making analyzer. This probe employs a lightweight distributed tracing framework to assess cross-layer anomaly metrics by tracking communication traffic. The analyzer automates the detection of anomalies and pinpoints the faulty GPU rank. The system was tested on a setup with four GPUs, showcasing its rapid anomaly detection and location capabilities. The findings were published on arXiv under ID 2605.04478v1.
Key facts
- CCL-D is a high-precision diagnostic system for slow/hang anomalies in large-scale distributed training.
- Traditional diagnostic methods are inaccurate and inefficient, requiring hours or days for root cause analysis.
- CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer.
- The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework.
- The analyzer performs automated anomaly detection and root-cause location.
- The system precisely identifies the faulty GPU rank.
- CCL-D was deployed on a 4-GPU setup.
- The research was published on arXiv with ID 2605.04478v1.
Entities
Institutions
- arXiv