AI Agents Never Stop. Your Hardware Might.

PREVIEWAI Agents Never Stop. Your Hardware Might. · MD

Veer Kheterpal runs 80 percent of his CEO workflows through AI agents — three times a day, the agent synthesizes his email, Slack, texts, and strategic contacts, surfaces priorities, launches deep work, and generates proposal decks that used to take weeks. Six proposals and four investor meetings in a single day. The CEO is Quadric's founder. The point is not the company. The point is what his workload reveals about how AI inference actually works — and what that means for the silicon being built to run it.

Here is the distinction that matters: cloud inference and edge inference are not the same system at different sizes. They are different architectures, running different workloads, with different silicon implications.

Cloud AI is a restaurant kitchen. Many users send requests to a shared GPU, the system cooks fast, delivers the answer, clears the table, and moves to the next order. Burst traffic. High utilization across many concurrent users. Efficiency comes from switching fast and keeping the GPU busy.

Edge AI — running on a laptop or in a factory floor robot — is a personal chef. One agent, running 24 hours a day, seven days a week. Cooking continuously for one person. Never stopping. Never switching to someone else's dish. The chef knows your preferences, monitors your pantry, and emails support when the stove starts making noise. That is a fundamentally different kind of cooking, and it requires a fundamentally different kitchen.

The silicon implications follow directly from the workload. Multi-tenant cloud inference is optimized for throughput — the GPU switches between tasks, evicts one job, takes the next. Edge inference for agents is single-tenant, continuous, and sustained. The compute profile is inverted: not bursty utilization across many users, but a single agent consuming tokens around the clock.

Quadric makes programmable AI processor IP — the blueprints chip designers license to build edge inference silicon, similar to how ARM licenses processor IP for mobile phones. Note: Quadric's business model is ARM-style IP licensing, so the comparison benefits the company's own positioning. The company has raised Series A, B, and C, has three generations of silicon, and sells IP to customers designing chips for laptops (running 7 billion to 30 billion parameter models locally), industrial manufacturing equipment, and robotics. Defense is a separate vertical.

The programmability bet is deliberate. Quadric built its C++ toolchain alongside the hardware starting in 2018, then added a Python layer. The reasoning, stated plainly: if hardware is locked to a specific attention mechanism or number format, it becomes obsolete when the next model architecture arrives. The memory wall problem makes this acute — LLMs are memory-bound, and the gap between what models need and what silicon delivers has been the constraint the industry keeps running into.

On performance improvements over the last roughly 24 months, Kheterpal estimates software delivered 23 to 25 times the gains hardware did in the same period — flash attention, FP8 quantization, speculative decoding, and new model architectures cutting memory usage and increasing throughput, while hardware improvements (process shrinks, packaging, HBM) delivered roughly 10 to 15 times. Those are his own figures, not third-party benchmark results, and they carry the usual self-serving caveat that applies to any founder describing their market.

What the memory wall implies for edge silicon is this: fixed-function accelerators built for today's model architecture risk becoming stranded assets when the next quantization technique or attention mechanism arrives. Programmable processors can be recompiled. Kheterpal frames this as the ARM model versus the ASIC model for edge AI — license flexible, evolvable IP, or build something optimized for today's workload that may be obsolete in 18 months. The analogy is useful; the commercial interest behind it is noted.

Industrial manufacturing is the near-term application he is most specific about. A machine in a factory running an agent — observing sensor data, writing diagnostic emails to support staff, analyzing production metrics — inverts the current cloud-centric model. Instead of streaming all data to a remote center for processing, the inference runs locally on a small form-factor chip, with cloud models used only for planning and architectural decisions. He calls it a small brain in the machine, emailing support when something looks wrong.

Robotics follows, but volumes are not there yet. Defense and security applications are already here.

The laptop use case is closest to shipping. Customers are designing chips with Quadric IP that run 7B to 30B parameter models locally — sufficient for most agentic tasks without touching the cloud subscription. Cloud models handle heavy reasoning that requires frontier-scale compute. The local chip handles the continuous, low-latency, always-on inference that frontier models cannot efficiently do because they are built for burst.

None of this is an argument that edge beats cloud. It is an argument that the map is not one size fits all. The restaurant kitchen and the personal chef both exist. The silicon companies building for both are making different bets — one on utilization across many users, one on sustained single-threaded intelligence. Kheterpal is betting on the latter.

The primary source is the Semiconductor Insiders podcast episode EP336 with Veer Kheterpal, CEO of Quadric.

AI Agents Never Stop. Your Hardware Might. — type0 | type0

AI Agents Never Stop. Your Hardware Might.

Sources