Your AI coding tool is about to feel noticeably snappier — and the model had nothing to do with it. OpenAI announced Monday that its Responses API, the interface developers use to build AI agents — autonomous tools that perform tasks in sequences — now supports a persistent connection mode called WebSocket mode that keeps the network open across multiple API calls rather than opening and closing a fresh connection each time. The architecture caches the previous response state in memory so the system sends only what changed, not the entire conversation history. The result: the same model, meaningfully faster workflows.
The improvement is real, and it is independent benchmarker Adam Holter who put a number on it. His tests found an 80% reduction in the accumulated roundtrip overhead — the time spent renegotiating what the system already knew — when using the new persistent connection mode. OpenAI's own blog post confirmed the same architecture direction, reporting a 40% end-to-end improvement for real agentic workflows built on this mode. The model did not change. The plumbing did.
The numbers from OpenAI's blog post: GPT-5.3-Codex-Spark, running on Cerebras hardware, now exceeds 1,000 tokens per second — roughly 15 times faster than earlier flagships like GPT-5 and GPT-5.2, which ran at roughly 65 tokens per second. OpenAI cut per-token overhead by 30% and reduced time-to-first-token by 50% using the cached previous_response_id mechanism to skip rebuilding context from scratch on each turn. "Even when most of the conversation hadn't changed, we still paid for work tied to the full history," the post notes.
Early partners report the gains. Vercel integrated the mode into its AI SDK and saw latency drop up to 40%. Cline's multi-file coding workflows ran 39% faster. Cursor, the AI coding editor, improved by up to 30%. Holter independently verified the infrastructure claims and confirmed the 80% roundtrip overhead reduction, calling the architecture changes the most significant new capability in the Responses API since its March 2025 launch, citing InfoQ's coverage of the API's debut.
HTTP-based agentic loops send the full conversation context on every API call — a design that made sense when inference was the bottleneck. As inference speeds climbed toward 1,000 tokens per second, the cumulative cost of rebuilding that context on every turn became the dominant latency source. The model was fast. The architecture was not.
The competitive pressure is real. If persistent connections become standard for agentic APIs, developers building on stacks without equivalent support from Anthropic, Google, or open-source alternatives face a widening performance gap. The question is whether the migration cost justifies the gain for most use cases.
There are caveats. Holter noted that Codex-Spark, the fastest model in the stack, trades some accuracy for speed: 56% on SWE-Bench Pro versus 56.8% for the flagship, with a wider gap on Terminal-Bench (58.4% versus 77.3%). More significantly, he found the model too aggressive with tool calls and token usage — so aggressive that the overhead it creates can eat through the speed gains. The bottleneck did not disappear. It moved.
The broader implication is architectural, not product-level. As inference approaches hardware limits, the next gains will not come from faster models. They will come from the systems built around them. The race is shifting from model to plumbing — and the plumbing was always the story.