OpenAI's New WebRTC Architecture for Low-Latency Voice AI

ai-technology · 2026-05-04

OpenAI engineers Yi Zhang and William McDonald detail a rearchitected WebRTC stack for real-time voice AI, addressing scalability challenges for over 900 million weekly active users. The system uses a split relay-plus-transceiver model: a lightweight relay layer handles packet routing via ICE username fragments, while transceivers terminate WebRTC sessions. This design reduces the public UDP footprint to a small number of ports, enabling deployment on Kubernetes without exposing large port ranges. Global Relay ingress points shorten first-hop latency, and geo-steered signaling directs clients to nearby clusters. The relay, written in Go, uses SO_REUSEPORT and thread pinning for efficiency. Key results include lower latency, reduced jitter, and simplified infrastructure scaling. The architecture preserves standard WebRTC behavior for clients, ensuring interoperability with browsers and mobile apps. OpenAI's approach avoids kernel bypass, relying on a narrow implementation that handles global real-time media traffic with a small relay footprint.

Key facts

OpenAI serves over 900 million weekly active users with voice AI.
The new architecture uses a split relay-plus-transceiver model.
Relay routes packets using ICE username fragments (ufrag).
Transceivers terminate WebRTC sessions and own protocol state.
Global Relay fleet provides geographically distributed ingress points.
Geo-steered signaling directs clients to nearby transceiver clusters.
Relay is written in Go and uses SO_REUSEPORT and thread pinning.
The design reduces public UDP footprint to a small number of ports.

OpenAI's New WebRTC Architecture for Low-Latency Voice AI

Key facts

Entities

Institutions

Sources