The AI Inference Stack's Center of Gravity Is Moving Off the GPU

PREVIEWThe AI Inference Stack's Center of Gravity Is Moving Off the GPU · MD

Increasingly, the polite fiction of enterprise AI has been that serious inference requires a uniform fleet of top-tier accelerators. An experiment from IBM Research, Red Hat, and India's NxtGen Cloud Technologies is the first credible data point that says otherwise: the trio reports 3–5x inference speedups and roughly 2x throughput on a single cluster pulling together GPUs from three different vendors, running the open-source inference project llm-d.

The result is not a hardware story. It is a control story. The piece of software doing the work is llm-d, a distributed inference orchestrator that sits on top of established serving engines like vLLM and SGLang, allocates requests across machines, and decides which chunks of model work run on which hardware. The trick the team is betting on is a cache-aware router. By tracking the KV cache (the pre-computed intermediate state the model reuses so it does not have to redo work on every request), llm-d can split a single conversation between, say, a high-end accelerator for the prompt and a cheaper older card for the response generation, then keep both fed without re-priming the cache from scratch.

That is the technical move. The institutional move is the bigger one. IBM has donated llm-d to the Cloud Native Computing Foundation, the same body that governs Kubernetes. The donation reframes the project from an IBM product into shared infrastructure, with the kind of multi-vendor governance that Kubernetes itself used to make Linux containers the boring default of the cloud.

That is why the speedup number matters in a different way than the wire framing suggests. The interesting economic question is not "is 3–5x faster?" but "what happens to GPU pricing power when the routing layer is open, community-governed, and indifferent to which vendor's silicon sits underneath?" If the answer is that the orchestrator captures the value the accelerator vendor used to capture, then the assumption that an enterprise must buy homogeneous top-bin GPU fleets becomes experimentally optional. Procurement teams that mix hardware generations, lease older accelerators, or pull silicon from different vendors can now plausibly run production-scale LLM serving on whatever they happen to own, as long as the routing layer holds the latency guarantee.

The team is careful to call out what is and is not yet settled. The 3–5x figure is a vendor-reported result on the project's own code path, run by its creators and a partner cloud provider. It is not an independent MLPerf-style cross-vendor benchmark. The legitimate technical friction is real and well-named. Divergent driver stacks across vendors have to be reconciled, container runtimes have to behave, in-flight requests have to be drained and rescheduled without breaking latency SLOs (the service-level objectives that translate to "responses must arrive within X milliseconds to end users"). None of that is a footnote. It is the actual engineering problem llm-d is trying to make boring.

The sovereign-AI framing—keeping data and models on hardware the enterprise owns or leases, usually for cost, compliance, or jurisdictional control—is the obvious commercial hook. It is also where the bet has the most obvious buyer, and buyers in defense procurement, EU digital-sovereignty programs, and regulated banking are plausible candidates for exactly this kind of procurable, inspectable, vendor-neutral inference layer. The CNCF donation is the institutional signal that llm-d wants to be the default for those buyers, not an IBM product they would have to justify.

The falsifier is independent. If a third party, such as an academic lab, a hyperscaler benchmarking team, or a future MLPerf submission, runs heterogeneous GPU clusters through llm-d or a comparable orchestrator and gets different numbers, the procurement argument weakens. If the numbers hold, the inference stack's center of gravity migrates from the accelerator vendor to the routing layer, and the next decade of AI infrastructure value lives wherever that routing layer is governed.

For now, the direction of travel is set. Whether the bet pays off depends on whether the orchestration code is as portable, as governed, and as reliably boring as the project positioning claims.

The AI Inference Stack's Center of Gravity Is Moving Off the GPU — type0 | type0

The AI Inference Stack's Center of Gravity Is Moving Off the GPU

Sources