HeadsUpAI

OpenAI Rebuilds WebRTC Stack to Scale Low Latency Voice AI

OpenAI rearchitected its WebRTC (a protocol for real-time audio and video) stack to power ChatGPT voice and the Realtime API. The design splits the workload between a stateless relay and a stateful transceiver. This replaces the one-port-per-session model, which fits poorly within Kubernetes (a system for managing containerized software) environments.

Standard WebRTC requires thousands of open UDP ports, creating security risks. By using a "Global Relay" network, OpenAI shortens the distance between users and servers to minimize jitter. This infrastructure shift mirrors OpenAI's WebSocket-based Responses API and OpenAI's open-source voice component where delays break conversational flow.

The architecture ensures audio arrives as a continuous stream, allowing models to reason while a user is still talking. This design is optimized for 1:1 interactions typical of AI agents. You can access these low-latency capabilities through the Realtime API or the ChatGPT mobile and web applications.

OpenAI Developers
OpenAI Developers
@OpenAIDevs
X

🎙️ Voice AI only feels natural when conversation keeps pace with speech. Here’s how we rebuilt our WebRTC stack with a thin relay and stateful transceiver to keep real-time media fast for ChatGPT voice, the Realtime API, and more. https://t.co/JEvs2PmsmC

74retweets840likes
View on X

Still wondering? A few quick answers below.

OpenAI rearchitected its stack to handle massive scale while maintaining low latency for 900 million weekly users. The previous model required thousands of open ports per session, which was difficult to secure and scale on Kubernetes. The new design uses a split relay and transceiver model to provide a smaller, fixed network footprint.

The relay acts as a lightweight forwarding layer that routes packets without decrypting them. It uses the ICE username fragment, a protocol-native identifier, to steer traffic to the correct destination. The transceiver remains the stateful endpoint that handles encryption, session lifecycle, and media processing, allowing backend services to scale like standard web applications.

Global Relay is a geographically distributed network of ingress points designed to shorten the distance between a user and OpenAI's infrastructure. By accepting packets at a relay close to the user, the system reduces network jitter and round-trip time. This ensures that conversational turn-taking in ChatGPT voice and the Realtime API feels natural and responsive.

OpenAI avoids the traditional one-port-per-session model, which is brittle for autoscaling and difficult to load balance. Instead, they use a small number of stable UDP ports and a custom routing layer. This allows pods to be added or rescheduled without breaking active sessions, as the relay can deterministically recover routes using metadata already present in the connection setup.

This rearchitected media stack currently powers real-time voice interactions for ChatGPT and the Realtime API. It is also used for internal research projects and interactive AI agents that require continuous audio streaming. The design allows these systems to process audio, perform transcription, and generate speech while a user is still talking, rather than waiting for a full upload.

Share this update