On Thursday, swyx, the organizer of the world's largest AI engineering conference series, reversed a long-held position on open AI models during a crossover episode of Latent Space and the Unsupervised Learning podcast. The same day, Cerebras filed its second IPO S-1 and Google unveiled separate chips for training and deploying AI. Three independent signals pointing in the same direction on the same day is not coincidence. Something changed in the underlying economics.
What changed: speed. Standard GPUs, the graphics processors used by every major cloud provider, run large language models (the AI systems behind chatbots and code assistants) at roughly 50 to 100 tokens per second, where a token is approximately three-quarters of a word. Custom chips built specifically for AI inference run the same models at 1,000 to 3,000 tokens per second, according to benchmarks published by Cerebras. One startup, Taalas, has posted results of 17,000 tokens per second by hardwiring a single model permanently into the chip, giving up flexibility for raw throughput. That puts the speed gap between standard cloud infrastructure and the fastest custom silicon at 10x to 340x, depending on which endpoints you compare.
swyx's earlier position, as described on the episode: faster inference for open models was not meaningfully different from what you could already get. A doubling of speed did not change what you could build. His new position: every substantial step upward in inference speed opens product categories that did not previously exist. Real-time voice agents that sound like a conversation rather than a transcription service. Streaming interfaces that respond before the user finishes reading the previous sentence. Those use cases require response speeds that standard GPUs could not deliver at a price that made the product viable.
The economics reinforce the speed argument. MIT Sloan researchers estimated that open models cost roughly 87 percent less to run than comparable closed models (the kind offered by OpenAI or Anthropic) while delivering about 90 percent of the performance. That comparison is model-to-model on a benchmark exchange, not a full enterprise cost accounting. But for teams running millions of AI calls per day, 87 percent lower inference cost changes the build-vs-buy calculus regardless of what the leaderboards say.
Cerebras, the chip company whose Thursday IPO filing coincided with swyx's episode, reported $510 million in 2025 revenue but disclosed that 86 percent came from two Abu Dhabi AI customers, according to The Next Platform's analysis of the filing. The valuation being sought is $23 billion, and the company has a reported $10 billion deal with OpenAI already signed. Google's Thursday announcement was a different kind of signal: the company said it was splitting its eighth-generation TPU into separate training and inference chips because, as Google's senior vice president for AI infrastructure Amin Vahdat put it, "with the rise of AI agents, we determined the community would benefit from chips individually specialized to the needs of training and serving."
Nathan Lambert, an AI analyst at Interconnects AI who tracks the open versus closed model question, has written that open models have consistently trailed the best closed models by six to 18 months on capability and that this gap has been stable rather than narrowing. For applications that require frontier reasoning, that lag still matters. A legal research tool or a drug discovery system needs the best available model regardless of cost per token. But a customer support chatbot or a voice interface does not. If open models on custom silicon are good enough for the large share of enterprise use cases in that second category, and 87 percent cheaper, the adoption math shifts.
The numbers behind the shift carry caveats. The Cerebras and Taalas benchmarks are hardware performance claims, not production figures under real load from real users. If GPU-based cloud providers narrow the gap with their own hardware upgrades, the arithmetic changes. The 87 percent cost comparison is model-to-model, not infrastructure-to-infrastructure. And faster inference does not close the six-to-18-month capability gap. A model that runs a thousand times faster but still trails on complex reasoning tasks is a faster version of the same limitation.
What to watch: whether the major cloud providers begin offering custom inference silicon as a standard service tier, which would commoditize the speed advantage Cerebras and Taalas currently hold, and whether the benchmark gaps hold when Cerebras's customer base expands beyond two Abu Dhabi entities. The hardware case for open models is real. The production case is still being tested.