OpenAI rebuilt real-time voice architecture with Redis to handle 900 million weekly users
Justin Uberti spent a decade building WebRTC into Google's Hangouts, Meet, and Duo. Sean DuBois built Pion from scratch — the open-source Go implementation that became the reference standard for WebRTC in that language. Both are now at OpenAI, and the problem they solved there is the same one every company hits when it tries to run real-time voice on cloud-native infrastructure: Kubernetes wants everything to be temporary; voice sessions want the opposite.
The answer, described in detail in an OpenAI tech blog post published Monday, runs through Redis. OpenAI rebuilt its real-time voice architecture from the ground up to serve more than 900 million weekly active users, replacing the one-port-per-session model that doesn't fit hyperscaler infrastructure with a split relay plus transceiver design backed by a Redis cache. The post is the most detailed public accounting yet of how a major lab bridges cloud elasticity and real-time media — authored by the two people who helped define what WebRTC is.
Standard WebRTC deployment requires one public UDP port per session. At OpenAI's scale, that means managing tens of thousands of ports across a fleet that autoscaling constantly redistributes. Cloud load balancers weren't designed for it. Kubernetes services weren't either. The exposed port surface becomes an operational and security burden, and pods can't move freely without breaking the sessions they own.
OpenAI's solution was to split packet routing from protocol termination. Media enters through a lightweight relay that never decrypts anything, never runs an ICE state machine, and never terminates WebRTC. The relay reads just enough from the first packet — specifically the ICE username fragment, a short identifier already present in every WebRTC session setup — to infer which transceiver owns the session and forward accordingly. From the client's perspective, nothing changes. It's still speaking standard WebRTC to a standard WebRTC endpoint.
The transceiver owns everything stateful: ICE connectivity checks, DTLS handshake, SRTP encryption, session lifecycle. This is where the Redis cache matters. Once a route is established — client IP and port mapped to transceiver IP and port — that mapping lives in Redis. If a relay restarts and loses its in-memory session table, the next STUN packet from the client rebuilds the route using the ufrag hint before it arrives, or recovers it from Redis before the next packet lands. Session continuity survives the infrastructure churn that Kubernetes constantly generates.
The relay itself is intentionally simple. OpenAI wrote it in Go, kept it narrow, and runs it behind a small fixed UDP surface instead of the large port ranges that one-port-per-session WebRTC would demand. The blog post calls it "ephemeral state" — a short-timeout in-memory map plus Redis backup — so restarts don't cause meaningful traffic drops.
That simplicity is what makes horizontal scaling work. Multiple relay instances run behind a load balancer. The relay doesn't hold hard WebRTC state, so it can restart, reschedule, or scale without coordinating session ownership. The transceiver handles that, and the Redis cache handles the gap when the relay's memory doesn't survive.
Global Relay extends the pattern geographically. It's OpenAI's fleet of geo-distributed relay ingress points — distributed UDP forwarders placed close to users in both geography and network topology. The closer the first hop, the lower the latency and jitter before traffic hits OpenAI's backbone. Cloudflare geo and proximity steering handles the initial HTTP or WebSocket request routing, which determines which Global Relay address gets advertised to the client in the SDP answer.
DuBois, speaking in a webrtcHacks Q&A last year, described the starting condition bluntly: implementing WebRTC inside Kubernetes was challenging because everything was designed around HTTP and WebSocket protocols. His first weeks involved building a demo and trying to integrate it with existing infrastructure. The conclusion: to ship quickly, they needed to run the WebRTC implementation outside Kubernetes. The architecture described in Monday's post is the solution to that problem, now hardened for 900 million users.
Whether this pattern is novel or whether LiveKit, Daily, and other infrastructure providers have solved the same Kubernetes-WebRTC tension differently is the unresolved question. LiveKit, which OpenAI uses for ChatGPT's Advanced Voice mode, takes a different approach — a selective forwarding unit that terminates separate WebRTC connections for each participant. OpenAI's transceiver model dispenses with the SFU entirely, which it argues is the right default for 1:1 latency-sensitive traffic. The tradeoffs between the two approaches at scale aren't publicly documented.
The latency question also lacks independent verification. The blog post describes the architecture but doesn't cite latency benchmarks. A webrtcHacks analysis from January 2025 measured response latency at roughly 1.7 seconds using RTP packet analysis — before the current Global Relay and Redis-backed recovery architecture was deployed. Whether the new architecture materially improves on that number is unconfirmed.
What is confirmed is the talent. Uberti built WebRTC at Google. DuBois built Pion from scratch and maintained it as the open-source Go reference implementation. Both chose OpenAI. The problem they're solving — real-time media at AI scale — is the same problem every company building voice AI will eventually hit. The architecture they landed on, with Redis as the bridge between stateless infrastructure and stateful sessions, is the most detailed public template for how to get there.