Baseten says custom AI models are turning open weights into a chip problem
The cheap-AI story has a missing bill attached: once companies customize a model for their own product, they also inherit the problem of keeping that model fast, reliable, and fed with scarce chips. Baseten says that is already where most of its demand sits, which makes open models look less like free software and more like the start of a new infrastructure dependency.
Tuhin Srivastava, founder and CEO of Baseten, an AI inference cloud, put a number on that shift in a No Priors Podcast interview: he said 95 percent of Baseten's tokens are on dedicated inference, where customers run modified models with their own data rather than plain open-source models. Inference is the act of running an AI model after it has been trained, and a token is a small chunk of text that the model reads or generates. The fresh implication is sharper than another GPU-shortage anecdote: the more AI products specialize, the more the supposedly abundant model layer depends on scarce physical capacity.
This is a self-interested claim. Baseten sells inference infrastructure, so scarcity is good for its story. The No Priors host framed Baseten as having grown 30x over the past year and expecting more than $1 billion in revenue this year; Srivastava did not dispute that framing before explaining the demand surge. Baseten's actual growth path involves compounding from a smaller base: five months earlier, Fortune reported that Srivastava described revenue as up "more than 10x" over the prior 12 months. Both are company-reported snapshots from different periods, not audited public filings, and the gap is a reason to treat the scale claim as directional rather than exact.
The inversion starts with what customers are actually running. In practical terms, Baseten is saying most demand is not coming from companies simply grabbing a free model and calling an API. It is coming from businesses that have made the model specific enough that they need dedicated capacity, service-level guarantees, and performance tuning. Srivastava described customers changing models for quality and performance, and said "no one is just running the vanilla open source weights," in the No Priors interview.
That cuts against the easy version of the open-model story. If model weights become widely available but every serious product has to customize, compile, host, and guarantee them, the value does not vanish into the model file. It moves into the machinery around it.
The machinery is not abstract. Srivastava said Baseten now sits in 18 different clouds and can bring a new provider online with the Baseten inference stack in half a day or less, according to No Priors. The company built that fabric for reliability, latency, and failover, but the same system has become a way to find compute "wherever humanly possible." That is a strange position for a software company to be in: the product gets better only if the company can keep arbitraging physical capacity across data centers.
The supplier side is even narrower than the headline GPU shortage suggests. Srivastava said there are "probably like a dozen good clouds" and only "three or four" he would put in the gold tier for AI inference, according to the podcast. That distinction matters for buyers. A cluster that exists on paper is not the same as a provider that can hit latency, reliability, and support requirements for production inference. The robot says there is capacity. The operator asks whether it will stay up.
The strongest part of the scarcity argument is also the least independent. Srivastava said Baseten can add a new provider in another country to its inference fabric in about half a day, but "even for us, it is hard for us to grow," according to No Priors. He described a standing 4 p.m. company meeting devoted to managing capacity for current demand. That is not market-wide proof. It is a CEO's operating anecdote from a company that benefits when buyers believe supply is tight.
Funding context shows why Baseten has an incentive to tell this story loudly. The company raised $75 million in early 2025 to expand inference workloads, SiliconANGLE reported. Fortune reported a $150 million round at a $2.15 billion valuation in September 2025. In January 2026, HPCwire reported that Baseten raised $300 million at a $5 billion valuation to power what it called a multi-model future. The company has every reason to frame inference as the last great AI market.
The useful question is not whether Baseten benefits from that framing. It does. The useful question is whether the market facts point the same way. If most production demand shifts toward custom models, then AI applications do not escape infrastructure economics by using open weights. They become more dependent on specialized hosting, tuning, and capacity planning. If only a few providers can run inference reliably at scale, then the cloud layer keeps pricing power even as model files circulate more freely.
That is where Srivastava's broadest claim lands. He called inference "the last market," because even if artificial general intelligence arrives, the remaining business is running intelligence for users again and again, according to No Priors. Strip away the maximalist language and the operating point is still sharp: training creates a model once, but inference charges the meter every time the model is used.
The counterforce is supply. If Nvidia, TSMC packaging capacity, high-bandwidth memory, and new cloud buildouts catch up faster than expected, today's scarcity can turn into tomorrow's margin compression. Baseten's mid-90s utilization is also Baseten's number, not an independently audited market statistic. It may reflect the customer mix of a fast-growing inference specialist more than the whole AI economy.
Still, the direction of pressure is clear enough for builders. The cheap-intelligence story assumes inference becomes an abundant utility. The operating evidence from Baseten points to a more annoying future: more intelligence does arrive, but through scarce reliable clouds, custom model loops, capacity meetings, and companies whose software advantage depends on whether they can keep the chips busy without running out of them.