Harvey cut AI inference costs threefold. The bigger number for the frontier labs is 80.

Harvey cut AI inference costs threefold. The bigger number for the frontier labs is 80. — type0 | type0

PREVIEWHarvey cut AI inference costs threefold. The bigger number for the frontier labs is 80. · MD

In June, Harvey, a legal-AI company, cut its AI inference bill to roughly a third of what it had been. The company routed most tasks to a small open-weight model and reserved the frontier for the hard cases. The more consequential number is 80, the share of AI workloads that Coinbase co-founder Brian Armstrong predicted on X will run on 99% cheaper models within 12 to 18 months. If Armstrong is even directionally right, the savings come straight out of OpenAI and Anthropic's revenue, just as both labs are heading toward IPOs.

Harvey, which builds AI tools for law firms and in-house legal teams, partnered with Fireworks AI, an inference platform, on a routing architecture that uses an open-weight worker for routine work and reserves Anthropic's Claude Opus for the hardest sub-tasks. On a 100-task slice of Harvey's Legal Agent Benchmark, the hybrid stack hit 18 out of 100 all-pass and cost $368. Running Claude Opus 4.7 end-to-end on the same slice: 14 out of 100 all-pass, for $954. The hybrid won on both axes. The threefold cost reduction is the data point. The architecture is the part that should be making the frontier labs nervous.

What Harvey actually proved is that the real divide in the market is not proprietary versus open-weight. It's large versus small. Russell Brandom reported in TechCrunch that switching GPT-5.5 to DeepSeek's V4 Flash works, and that switching to OpenAI's GPT-5.4-mini works just as well. The price war between in-house inference and independently served open-weight models is active. The load-bearing question is whether a small model can do the job at all. Harvey's result says yes, for the bulk of a 100-task legal benchmark.

That result reframes what "quality" means. Gabe Pereyra, Harvey's co-founder, told TechCrunch: "Quality comes first, and in legal it always will. However, the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently." That is a quiet redefinition of the term the entire industry has been optimizing for. "Best" used to mean "biggest." The Harvey test suggests it now means "right-sized," with a routing layer that picks the right model per sub-task.

The pressure on the labs is not new, but the math is getting sharper. A TechCrunch investigation four days earlier documented the cost scramble already underway: Uber burned through its entire 2026 AI coding budget by April. Microsoft revoked Claude Code licenses. Priceline's Cursor renewal came in four to five times more expensive than the year before. The Linux Foundation, the open-source standards body, launched a Tokenomics Foundation working group to make AI unit economics legible. J.R. Storment, executive director of the FinOps Foundation, called the current moment an "existential crisis" for teams running three times over their token budgets. Chris Reed, a senior director of IT finance at Priceline, described the dynamic to TechCrunch as a "crack-cocaine epidemic" of per-seat AI tooling. Even Alexander Embiricos, OpenAI's head of enterprise, acknowledged to TechCrunch that the conversation has shifted from capability to efficiency.

That shift has not yet shown up in lab revenue lines, but the IPO clock is ticking. OpenAI and Anthropic are both reportedly preparing public offerings. If buyers can route 80% of their workloads to models that cost 1% as much, the addressable market for premium inference shrinks just as the companies are trying to grow into their valuations. Labs can compete on the 20% of workloads where IQ-maxing matters. They can compete by buying their way into the routing layer. The third option, hoping customers keep paying frontier prices out of habit, is the one Armstrong's prediction forecloses.

The skeptical read deserves its own paragraph. Companies under cost pressure often economize in ways that look like model substitution but aren't: they make fewer API calls, shrink their context windows, kill their least promising deployments. The Harvey result is a specific architectural choice on a specific task mix. Generalizing from 100 legal tasks to all of enterprise is a real extrapolation, and Brandom flags it as the central unknown. The smaller-model shift is a hypothesis with one strong data point, not a measured industry outcome. Niko Grupen, Harvey's head of applied research, wrote in the Fireworks blog post that the harness engineering, not just the model swap, is what makes the architecture work. Reproducing that harness across a customer-support stack, a coding workload, or a financial-analysis pipeline is not a free option.

What to watch: the next two quarters of customer disclosures from OpenAI and Anthropic's enterprise accounts, particularly the share of revenue coming from the top-decile workloads where frontier pricing still holds. Watch the open-weight release cadence from DeepSeek, Moonshot's Kimi line, and the smaller Anthropic and OpenAI tiers. The closer those small models get to frontier quality on real workloads, the faster the routing spreads. Watch the routing-layer startups that have started to productize the Harvey pattern. The first ones to ship a turnkey frontier-advisor-plus-open-worker harness will set the default for everyone who follows.

The industry is moving past "biggest model wins" into a fit-for-purpose era. The frontier labs are not finished. The open question is who owns the routing layer, the part of the stack that decides which model answers which question. That answer is what Armstrong's 80 will actually mean.

Harvey cut AI inference costs threefold. The bigger number for the frontier labs is 80.

Sources