OpenAI and Microsoft deploy MRC and SRv6 for resilient AI supercomputer networking
A new RDMA-based transport protocol called MRC, combined with multi-plane Clos topologies and static SRv6 source-routing, has been deployed in production across OpenAI and Microsoft's largest training clusters. The approach eliminates flow collisions by spraying across multiple paths with active load balancing, enables clusters exceeding 100K GPUs using two-tier topologies with increased redundancy, and allows automatic bypass of network failures. MRC has been used to train the latest frontier models, allowing jobs to ride out failures that previously would have interrupted training.
Key facts
- MRC is a new RDMA-based transport protocol
- MRC sprays across many paths and actively load-balances between them
- Multi-plane Clos topologies enable clusters over 100K GPUs as two-tier topologies
- Static source-routing using SRv6 allows MRC to bypass failures
- Deployed in production at OpenAI and Microsoft's largest training clusters
- Used to train the latest frontier models
- Allows AI training jobs to ride out many network failures
- Tail latency dominates performance of synchronous pretraining at very large scales
Entities
Institutions
- OpenAI
- Microsoft