CCL-D: High-Precision Diagnostic System for Slow/Hang Anomalies in Large-Scale Model Training

ai-technology · 2026-05-07

A novel diagnostic tool named CCL-D tackles the prevalent issues of slowdowns and hangs in large-scale distributed model training, which are often the most challenging and time-consuming to identify. Conventional diagnostic approaches tend to be both inaccurate and slow, sometimes requiring hours or even days for root cause analysis. CCL-D features a real-time probe at the rank level coupled with an intelligent decision-making analyzer. This probe employs a lightweight distributed tracing framework to assess cross-layer anomaly metrics by tracking communication traffic. The analyzer automates the detection of anomalies and pinpoints the faulty GPU rank. The system was tested on a setup with four GPUs, showcasing its rapid anomaly detection and location capabilities. The findings were published on arXiv under ID 2605.04478v1.

Key facts

CCL-D is a high-precision diagnostic system for slow/hang anomalies in large-scale distributed training.
Traditional diagnostic methods are inaccurate and inefficient, requiring hours or days for root cause analysis.
CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer.
The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework.
The analyzer performs automated anomaly detection and root-cause location.
The system precisely identifies the faulty GPU rank.
CCL-D was deployed on a 4-GPU setup.
The research was published on arXiv with ID 2605.04478v1.

CCL-D: High-Precision Diagnostic System for Slow/Hang Anomalies in Large-Scale Model Training

Key facts

Entities

Institutions

Sources