OpenAI Halved Inference Cost on a Key ChatGPT Tier Using Only Software
Inference, the recurring cost of every AI response, just got cut in half for logged out ChatGPT visitors without new chips, a software win that resets the AI pricing debate.
Inference, the recurring cost of every AI response, just got cut in half for logged out ChatGPT visitors without new chips, a software win that resets the AI pricing debate.
OpenAI just told the AI industry that the most expensive part of running a frontier model is now a software problem.
According to reporting from The Information, summarized by TechTimes, OpenAI engineers demonstrated an internal optimization in June 2026 that cuts inference cost by more than half, applied entirely to the company's existing server infrastructure. When the team ran the new code against ChatGPT's logged-out visitor traffic, the segment ran on roughly a couple hundred Nvidia GPUs. The Information cites a single person familiar with the internal discussions, and OpenAI has not publicly confirmed the GPU figure. The directional claim is concrete: software, not silicon, did the work.
Inference is the recurring cost of generating each AI response. Unlike training, which is paid once, inference scales with every query, every user, every day. Audited Azure spend reviewed by multiple outlets shows OpenAI ran a $5.02 billion Azure inference bill in the first half of 2025 alone, a number that grows with every product and every customer. A structural reduction on even one traffic segment changes the unit economics math that API prices, profit forecasts, and competitor positioning all rest on.
The headline number is the software halving. The under-reported number is the macro pressure that made it matter. Nvidia H100 list prices roughly doubled between 2024 and 2026 as AI demand outran the supply chain, and GPU availability stayed tight through early 2026. In that environment, any lever that decouples cost from chip procurement has strategic weight. OpenAI's June optimization is exactly that lever. It cut the GPU bill for a meaningful slice of traffic without waiting for a foundry ramp, a fab build-out, or a procurement cycle.
The optimization is not a stand-alone event. OpenAI's own announcement of its Broadcom-built Jalapeño inference chip confirms parallel silicon-side work: custom chips for the long term, software efficiency for the present. The Jalapeño program targets steady-state inference economics. The June optimization targets the same math on a shorter cycle. Treating them as competing stories misses the point. Inference cost is now a portfolio problem at OpenAI, with software and silicon as parallel lines of attack.
What changes for OpenAI's customers depends on which segment they sit in. Enterprise buyers paying API prices should ask vendors how much of their recent cost decline came from procurement and how much from code. The fourweekmba analysis of OpenAI's 2026 cost optimization frames this as a profitability race, where the company that holds the lowest per-query cost gains pricing power across the rest of the stack. Builders shipping products on top of OpenAI should pressure-test any roadmap assumption that depends on API prices staying where they are. The June result is a single internal demo on a single tier, not a price cut, but it raises the ceiling on how cheap the underlying cost could get if the optimization generalizes. Casual ChatGPT users are unlikely to see a direct bill change, but the rate-limit conversation shifts: if OpenAI can serve logged-out traffic with a couple hundred GPUs, the cost case for raising free-tier usage caps is stronger than it was a quarter ago.
The competitor response is the watch item. Anthropic, Google DeepMind, and the open-weight runners all face the same recurring-cost pressure. Seeking Alpha's coverage of the report flags OpenAI's move as evidence that software-side efficiency is now a competitive moat, not a backend concern. The next data point is whether Anthropic or Google publishes a comparable software win before the end of Q3 2026. If they do, the structural shift is industry-wide. If the optimization does not generalize past the logged-out tier, the story narrows to a clever demo with limited read-across. What the June result already supports is concrete: a couple hundred GPUs can serve logged-out ChatGPT traffic today, and that is the benchmark every other frontier lab will be measured against for the rest of 2026.