HeadsUpAI

OpenAI Releases MRC Protocol to Stop Network Failures From Stalling GPU Clusters

OpenAI released the specification for Multipath Reliable Connection (MRC), a networking protocol developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. It extends RDMA over Converged Ethernet (RoCE) (a standard for high-speed data transfer between GPUs) and is now available through the Open Compute Project (OCP).
Availability
Open Compute Project
Supported Hardware
NVIDIA GB200, Broadcom, and others
Network Speed
800Gb/s interfaces
Cluster Scale
131,000 GPUs with two switch tiers
Routing Protocol
SRv6 Source Routing

Traditional networking acts as a "failure amplifier" in synchronous AI training: if one packet is delayed, thousands of GPUs sit idle. MRC shifts from complex dynamic routing to a deterministic "multi-plane" design that reduces switch tiers. This allows the network to route around failures in microseconds, maintaining momentum for frontier models like GPT-5.5.

You can now access the MRC 1.0 specification through the OCP to optimize large-scale AI infrastructure. While aimed at organizations managing massive GPU clusters, adoption across major hardware vendors ensures future AI-native networking will be more resilient. This release follows other infrastructure optimizations like OpenAI's WebSocket-based Responses API.

OpenAI
OpenAI
@OpenAI
X

We’ve partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time. https://t.co/AiV952AJXs

704retweets6.1klikes
View on X

Still wondering? A few quick answers below.

Multipath Reliable Connection is a new networking protocol designed for large-scale AI supercomputers. Developed by OpenAI alongside major hardware partners, it improves the reliability and speed of data transfers between GPUs. It specifically addresses the failure amplifier problem where a single network hiccup can stall an entire synchronous AI training job.

MRC uses a multi-plane network design that reduces the number of switch tiers needed to connect thousands of GPUs. It sprays data packets across hundreds of different paths simultaneously rather than using a single path. This approach, combined with static source routing, allows the network to detect and bypass failed links in microseconds without recomputing routes.

The MRC specification has been released as an open standard through the Open Compute Project. This allows the broader industry to use, build upon, and integrate the protocol into their own networking hardware and software. It is not a proprietary tool but a shared infrastructure standard intended to help scale AI systems across the entire ecosystem.

Traditional protocols like BGP can take seconds to route around failures, which is too slow for synchronous AI training. MRC handles congestion and link failures on a microsecond timescale. By using adaptive packet spraying and source routing, it eliminates core congestion and ensures that training jobs continue moving even when individual network components fail.

The protocol is available to any organization building or operating large-scale AI training clusters. It is already deployed in OpenAI's largest supercomputers using NVIDIA GB200 hardware at Microsoft and Oracle Cloud Infrastructure sites. Because it is an open OCP contribution, hardware vendors and cloud providers can now implement MRC in their own networking stacks.

Share this update