Veer Kheterpal, CEO of the edge AI chip company Quadric, runs most of his CEO workflows through AI agents. Not a chatbot that answers questions — an agent that reads his email, Slack, and texts, synthesizes everything into priority recommendations, and checks in three times a day. The 80 percent figure came from the podcast host's setup question; Kheterpal answered the substance of how the workflow works without repeating or endorsing that specific number. What he did say on Semiconductor Insiders: he can handle six proposals and four detailed investor conversations in a day that previously took weeks. The bottleneck is not the AI. It is the silicon underneath — and that bottleneck is about to reshape how AI chips are designed.
The reason is a structural difference between two kinds of inference that the industry is only beginning to confront. Cloud AI — the kind you get when you open ChatGPT, type a prompt, and get a response — is bursty and multi-tenant. Thousands of users share a GPU. Workloads arrive in spikes. The system evicts one job and takes the next. Edge AI designed for agents is the opposite: one agent running continuously, 24 hours a day, on dedicated silicon that never stops. Kheterpal used a kitchen analogy to make the point: cloud AI is like a restaurant kitchen cooking many things at once, constantly swapping out workloads, while a personal chef cooks for one person all the time, continuously.
That distinction sounds like a product description. It is actually an argument about computer architecture — and it has real consequences for what chips need to look like.
The compute profile is not a matter of degree. It is a different engineering problem. Cloud inference is optimized for throughput across many concurrent users. Agentic inference is optimized for sustained, low-latency operation by a single workload. Multi-tenancy drives decisions about memory allocation, power delivery, and thermal headroom that single-tenant continuous operation simply does not share. Kheterpal described it as a foundational question of whether you are designing your silicon, software stack, and memory access patterns with a single batch in mind or not.
The practical consequence shows up in how Quadric's licensees are deploying the architecture. Customers are designing chips for laptops capable of running dense 7 billion to 30 billion parameter models locally, using cloud models only for planning or architectural decisions that require frontier capability. Industrial manufacturing is another early pull: a machine on a factory floor running an agent that diagnoses problems, analyzes sensor data, writes code, and emails support — the whole loop without a round trip to a data center. Kheterpal described it as a shift from streaming data to the cloud for later processing, toward a model that runs a small brain locally — one that diagnoses problems, examines production data, and communicates with factory support and management entirely on-device.
Kheterpal's company, Quadric, is a semiconductor IP licensing firm — it sells the blueprints for AI processors, following the ARM model rather than building chips itself. The company was founded in 2017 by veterans of PDF Solutions who had previously co-founded a Bitcoin mining company acquired by Coinbase in 2020. Their bet was that the CUDA ecosystem that had standardized datacenter AI had no equivalent at the edge, and that a programmable architecture would age better than fixed-function accelerators as AI models evolved. That bet predated the current agent wave by years, Kheterpal said — the programmability bet was made a long time ago, and the big use cases arrived recently.
The emphasis on programmability is not incidental. Agentic workloads are dynamic in a way that cloud inference is not. An agent running continuous inference decides in real time which models to invoke — for analyzing audio, for camera streams, for writing software to do analysis — all simultaneously on the same hardware, Kheterpal said.
This is where Kheterpal offers a striking set of numbers that deserves attention. In the last 24 months, he offered two characterizations of hardware improvements in the same answer — roughly 10x in one framing, about 15x in another — citing TSMC process advances, better packaging, HBM density. Meanwhile, software improvements — flash attention, FP8 quantization, speculative decoding, new model architectures — account for 23 to 25x improvements. His summary: software improvements have consistently outpaced hardware gains year over year. The implication cuts against the narrative that AI progress is fundamentally a hardware scaling story: the bigger lever has been algorithmic efficiency, and the hardware that captures the most value from that efficiency is hardware that is flexible enough to run whatever the next model architecture happens to be.
That is why Kheterpal argues Nvidia's road map has been defined less by raw compute scaling and more by format flexibility. He described Nvidia's eight-year lead as rooted primarily in offering all the different number formats during training — starting with FP32 and expanding across the ecosystem — because that flexibility is what gives developers the freedom to experiment and deliver new model architectures. Tie hardware to specific attention mechanisms or number formats and you freeze the software ecosystem's ability to experiment. Leave it programmable and you ride the wave of software-driven gains.
There is a risk in this framing that deserves skepticism. The 25x software figure and the 10x-to-15x hardware figures are Kheterpal's own estimates, not independently verified benchmarks from an independent lab. The relative contribution of software versus hardware to AI performance improvements is genuinely contested and methodology-dependent. What counts as improvement — throughput, latency, accuracy, energy efficiency — changes the numerator substantially. The claim is directionally consistent with other estimates in the research literature, but it should be read as an informed industry view rather than a measured result.
The other open question is whether edge deployment of capable agentic AI is as close as the transcript suggests. Quadric's licensees are designing chips for laptops and industrial machines — but chips designed are not chips shipped. The gap between a licensee announcement and a product in a consumer's hands is measured in years, not quarters. Industrial deployment on factory floors involves qualification cycles, reliability testing, and integration work that consumer software updates do not. The timeline Kheterpal describes is real, but it is compressed from a customer road map, not an independent assessment of semiconductor supply chains.
The core argument, though, holds up: cloud inference and agentic inference are not the same workload wearing different labels. They demand different things from silicon. The bursty, multi-tenant model that powers today's AI APIs is architecturally optimized for a world where the unit of work is a prompt-response cycle. The continuous, single-tenant model that agentic AI requires is optimized for a world where the unit of work is a goal pursued over hours across multiple tool calls and model invocations. Those are not the same chip. And the companies that build for the wrong workload while the market moves to the other will find themselves with very fast restaurant kitchens and no customers who want to eat there.