OpenAIs New Voice API Benchmarks Mostly Check Out. The Price Does Not.
OpenAI launched three new voice models in its API on May 7. The benchmark numbers are mostly real. The price is harder to defend.
The new interface, GPT-Realtime-2, lets developers build live voice interactions into applications — the kind of thing a customer service call or a hands-free assistant might run on. OpenAI calls it the first voice model with GPT-5-class reasoning, and the internal numbers bear that out: a 15.2 percent improvement on Big Bench Audio, a standard audio intelligence test, and a 13.8 percent gain on Audio MultiChallenge compared to the previous version. On a Zillow call-handling test designed to stress-test the model against adversarial, fast-talking real estate agents, Realtime-2 achieved a 26-point lift in call success rate, climbing from 69 percent to 95 percent. The context window expanded from 32,000 to 128,000 tokens, which matters for applications that need to hold long conversations without losing the thread.
But the independent numbers are less flattering. Artificial Analysis ran Realtime-2 through its Speech Reasoning benchmark, testing how well each model handles multi-step verbal logic. OpenAI scored 83 percent. StepFun's voice model scored 98 percent. xAI's Grok Voice scored 97 percent. Gemini 3.1 Flash scored 97 percent. OpenAI is not losing by much, but it is losing, and to models that have received a fraction of the attention.
GPT-Realtime-Translate handles between-language conversation. It supports more than 70 input languages but only 13 output languages. That asymmetry matters. A business running customer service across Southeast Asia can receive in Vietnamese and Thai, but can only respond in English, Mandarin, Spanish, French, German, Portuguese, Italian, Polish, Turkish, Japanese, Korean, Arabic, and Hindi. Indonesian, Thai, Tagalog, and dozens of other languages are ingestion-only. For companies building global products, that is a significant constraint that the headline "70+ languages" obscures.
Pricing reveals the real problem. GPT-Realtime-Translate costs $0.034 per minute of audio. OpenAI's own Whisper model, which handles speech-to-text transcription, costs $0.017 per minute. Translation is twice the price of transcription. In production, that math compounds: a call center handling 10,000 minutes of audio daily is looking at real money before any other cost enters the picture. Fora Soft, an independent voice infrastructure firm that analyzed Realtime pricing, calculated that the all-in cost for a typical voice call runs $0.30 per minute when input and output tokens are combined at listed rates.
Fora Soft also flagged a detail that matters for a specific and sizable slice of OpenAI's potential customer base. As of May 2026, GPT-Realtime-2 is not HIPAA-eligible. Healthcare companies, insurers, and any business handling protected health information cannot use the API for patient-facing voice applications without additional compliance architecture. OpenAI's enterprise customers in clinical settings have been waiting for a cleared path; this release does not provide it.
OpenAI says 800ms voice-to-voice latency is achievable with the right system architecture, which is fast enough for natural conversation. That claim is technically plausible and competitive. But latency benchmarks and production latency are different things, and the gap between them depends on how a customer's infrastructure is built.
The three-model launch is real progress. The reasoning improvements on Big Bench Audio are legitimate, the context window expansion opens new application categories, and the translation model covers real breadth on the input side. But the independent benchmark scores, the output-language gap in Translate, the pricing structure, and the missing HIPAA eligibility are not trivia. They are the difference between a press release and a purchase order.
For buyers: the benchmarks are worth running yourself before signing anything.