Groq's LPU is now a built-in part of NVIDIA's Vera Rubin. The harder question is whether that makes it a real business or a feature on someone else's roadmap.

Groq's LPU is now a built-in part of NVIDIA's Vera Rubin. The harder question is whether that makes it a real business or a feature on someone else's roadmap. — type0 | type0

PREVIEWGroq's LPU is now a built-in part of NVIDIA's Vera Rubin. The harder question is whether that makes it a real business or a feature on someone else's roadmap. · MD

NVIDIA's decision to fold Groq's language processing unit, or LPU, into its next flagship data-center platform is the first concrete signal that AI inference is reorganizing into a stack of permanent specialized roles, and the more interesting question is not whether Groq wins but what it means when an independent chip designer becomes a load-bearing dependency inside the industry's dominant platform.

The LPU is a chip designed for the part of AI that runs after a model has read a prompt. A user asks a question, the model has to produce the answer one token at a time, and that generation phase, often called decode, is the place where latency and per-token cost matter most. Traditional GPUs were built to do everything, including the much heavier training step. Groq, founded in 2016, bet early that a chip optimized only for this generation stage could beat them at their own game, and a Chinese trade-press analysis in Leiphone (雷峰网) frames the resurgence of that decade-old thesis as the first real test of whether inference-specialized chips can be a standalone business.

The immediate catalyst is NVIDIA's Groq 3 LPX system, described in its developer blog, which pairs Groq's third-generation LPU with NVIDIA's upcoming Vera Rubin GPUs. NVIDIA has publicly described a roughly 25 percent Groq LPU and 75 percent Vera Rubin split for "high-value token generation" workloads, which is a load-bearing assignment: the LPU is being given the part of the pipeline where SRAM bandwidth and deterministic scheduling matter most. Per-chip bandwidth is quoted at around 150 terabytes per second, and a fully populated 256-LPU LPX rack is described as delivering up to 40 petabytes per second of aggregate bandwidth, with NVIDIA's LPX product page framing that as the latency edge over a single H100 with HBM3 memory.

The slot matters because it is the first time an independent inference accelerator has been integrated at the platform level inside NVIDIA's own stack. Until now, Groq sold its hardware largely as a standalone cloud service, with partnerships such as the Meta–Groq deal to run the official Llama API on Groq hardware. The LPX changes the relationship: NVIDIA owns the rack, the software surface, and the customer relationship, and Groq is providing one specialized component inside it. That is the structural shape of the AI compute stack splitting by workload rather than by vendor.

The economics behind that split are not in dispute. Epoch AI's analysis of AI chip component costs puts memory at roughly 63 percent of a modern accelerator's bill of materials, which is why the LPU's pitch, moving as much of the model as possible into on-chip SRAM where bandwidth is an order of magnitude higher than HBM and where access latency is deterministic, is structurally durable. If memory is the cost, then owning the memory architecture is the moat, and the LPU is, at the silicon level, a memory-architecture bet disguised as a startup.

What is genuinely unresolved is whether the LPU is a permanent layer of the stack or a feature that NVIDIA will absorb into its own silicon. The Leiphone piece gathers skepticism from chip architects who argue that the original LPU edge came from a compiler that statically scheduled a model's dataflow, and that advantage narrows as Transformer core operations converge across vendors. Mixture-of-experts models, where the model routes each token to different parameter subsets, work against fully static scheduling. Hardware-side redundancy, the kind that lets a GPU keep running when a computation fault is detected, can also erode the theoretical SRAM advantage. The remaining edge, by the architects' own read, is the raw SRAM bandwidth number.

That narrowing does not make the LPU a bad business. It makes the question of what the business actually is. Three possibilities are on the table, and they have very different consequences for the rest of the inference chip market. One, the LPU becomes a permanent specialized role inside the AI stack, sitting alongside Google TPU, Anthropic's compute-in-memory research, SambaNova's CPU+GPU+RDU systems, and Cerebras' wafer-scale chip, with each vendor owning a workload-shaped lane. Two, the LPU slot stays at the platform level but NVIDIA gradually internalizes the architecture, in which case Groq is a feature on someone else's roadmap, paid for as long as the integration is convenient. Three, the LPU category fragments the way networking did, into routers, switches, and SmartNICs, with Groq owning the equivalent of a SmartNIC niche and other vendors crowding in.

The watch item is the next compiler release. If Groq's static scheduling advantage survives the move to larger mixture-of-experts models, the LPU slot inside Vera Rubin is structural. If NVIDIA's own compiler and a future Rubin-generation SRAM configuration close the bandwidth gap, the slot becomes a transitional feature, and the independent LPU companies, including the Chinese startups the Leiphone piece says are building domestic variants of the same dataflow and SRAM design, will have to find a lane NVIDIA is not already serving. The Groq 3 LPX is the first platform-level test of which way that breaks.

Groq's LPU is now a built-in part of NVIDIA's Vera Rubin. The harder question is whether that makes it a real business or a feature on someone else's roadmap.

Sources