OpenAI Releases MRC Protocol to Stop Network Failures From Stalling GPU Clusters

OpenAI

May 7, 2026

OpenAI released Multipath Reliable Connection (MRC) as an open networking protocol to prevent single link failures from crashing massive AI training jobs. By spraying data across hundreds of paths and using static source routing, the protocol ensures frontier model training remains efficient even as clusters scale past 100,000 GPUs.

OpenAI released the specification for Multipath Reliable Connection (MRC), a networking protocol developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. It extends RDMA over Converged Ethernet (RoCE) (a standard for high-speed data transfer between GPUs) and is now available through the Open Compute Project (OCP).

Availability: Open Compute Project
Supported Hardware: NVIDIA GB200, Broadcom, and others
Network Speed: 800Gb/s interfaces
Cluster Scale: 131,000 GPUs with two switch tiers
Routing Protocol: SRv6 Source Routing

Traditional networking acts as a "failure amplifier" in synchronous AI training: if one packet is delayed, thousands of GPUs sit idle. MRC shifts from complex dynamic routing to a deterministic "multi-plane" design that reduces switch tiers. This allows the network to route around failures in microseconds, maintaining momentum for frontier models like GPT-5.5.

You can now access the MRC 1.0 specification through the OCP to optimize large-scale AI infrastructure. While aimed at organizations managing massive GPU clusters, adoption across major hardware vendors ensures future AI-native networking will be more resilient. This release follows other infrastructure optimizations like OpenAI's WebSocket-based Responses API.

View the full update on openai.com

OpenAI

@OpenAIMay 6

We’ve partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time. https://t.co/AiV952AJXs

7046.1k

View on X

Still wondering? A few quick answers below.

Multipath Reliable Connection is a new networking protocol designed for large-scale AI supercomputers. Developed by OpenAI alongside major hardware partners, it improves the reliability and speed of data transfers between GPUs. It specifically addresses the failure amplifier problem where a single network hiccup can stall an entire synchronous AI training job.

MRC uses a multi-plane network design that reduces the number of switch tiers needed to connect thousands of GPUs. It sprays data packets across hundreds of different paths simultaneously rather than using a single path. This approach, combined with static source routing, allows the network to detect and bypass failed links in microseconds without recomputing routes.

The MRC specification has been released as an open standard through the Open Compute Project. This allows the broader industry to use, build upon, and integrate the protocol into their own networking hardware and software. It is not a proprietary tool but a shared infrastructure standard intended to help scale AI systems across the entire ecosystem.

Traditional protocols like BGP can take seconds to route around failures, which is too slow for synchronous AI training. MRC handles congestion and link failures on a microsecond timescale. By using adaptive packet spraying and source routing, it eliminates core congestion and ensures that training jobs continue moving even when individual network components fail.

The protocol is available to any organization building or operating large-scale AI training clusters. It is already deployed in OpenAI's largest supercomputers using NVIDIA GB200 hardware at Microsoft and Oracle Cloud Infrastructure sites. Because it is an open OCP contribution, hardware vendors and cloud providers can now implement MRC in their own networking stacks.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenAI →

Keep reading

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

OpenAI rearchitected its media infrastructure with a split relay and transceiver model to support 900 million weekly users on Kubernetes. By routing packets based on protocol metadata rather than dedicated ports, the system maintains sub-second latency for real-time voice interactions at global scale.

Google DeepMindApr 24

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

Google DeepMind released Decoupled DiLoCo, a distributed training architecture that allows large-scale AI models to be trained across geographically distant data centers. The system uses asynchronous data flow to isolate hardware failures and reduces required bandwidth by orders of magnitude, enabling training over standard internet connections. This shift removes the need for single-site mega-clusters and allows for the use of mixed hardware generations.

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIAMay 5

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIA integrated higher-order optimizers like Muon into its Megatron Core framework to increase training efficiency for 30B-parameter models. This shift from standard data-parallel methods allows labs to maximize throughput on Blackwell-class hardware for the next generation of reasoning models.

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy

Sam AltmanApr 28

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy

OpenAI restructured its partnership with Microsoft to allow its products and services to run across all cloud providers while maintaining Microsoft as its primary partner. This shift enables OpenAI to scale its infrastructure through new alliances to meet massive inference demands.

What is OpenAI MRC?

How does the MRC protocol work?

Is OpenAI MRC open source?

Why is MRC better than traditional networking for AI?

Who can use the MRC protocol?

Keep reading

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy

Keep reading

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

NVIDIA Megatron Core Adds Muon Support to Accelerate Frontier Model Training

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy

OpenAI Ends Microsoft Exclusivity to Launch Multi-Cloud Strategy