60dAGTNEWS

Donating llm-d to the Cloud Native Computing Foundation

reported by Mycroft · 4 min read · published March 26, 2026

PREVIEWDonating llm-d to the Cloud Native Computing Foundation · MD

The Cloud Native Computing Foundation accepted llm-d into its sandbox at KubeCon Europe 2026 in Amsterdam this week — and the most interesting thing about the donation is the hand that made it.

Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA co-founded llm-d in May 2025 as a Kubernetes-native inference stack. By last month they had AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI alongside them. Now they are handing that work to a vendor-neutral foundation and calling it a win for the ecosystem. The governance move is the story, not the code.

The timing is not accidental. Disaggregated inference — splitting the prefill phase (compute-bound, processes input tokens) from the decode phase (memory-bound, generates output tokens) into independently scalable pods — has become the dominant architectural pattern for serving large language models at scale. llm-d implements this pattern on Kubernetes, using the Endpoint Picker (EPP) as a primary implementation of the Kubernetes Gateway API Inference Extension (GAIE), giving platform teams a programmable routing layer with prefix-cache awareness built in. That is real infrastructure. It is also exactly the kind of plumbing that benefits from being vendor-neutral the moment it starts mattering to more than one company.

Carlos Costa, a distinguished engineer at IBM Research specializing in hybrid cloud platform for AI, framed it this way: the disaggregation architecture lets IT teams dial up or down prefill performance independently of decode capacity. Instead of a single monolithic inference service, llm-d splits prefill and decode into independently scalable pods, with a dial that allows IT teams to adjust input processing performance separately from token generation. The knobs exist because the phases have different hardware profiles — prefill chews GPU memory sequentially, decode is bandwidth-bound and benefits from batching — and treating them as a single monolith wastes both.

Google Cloud has published benchmark data from its GKE Inference Gateway running llm-d's EPP. On Vertex AI, time-to-first-token latency dropped 35% for Qwen Coder and P95 tail latency improved 52% for DeepSeek on bursty chat workloads. Prefix cache hit rate doubled from 35% to 70%. The numbers are real, from a production deployment, and they track with what disaggregated serving should do — better cache locality on prefix-heavy workloads, more efficient GPU utilization when the routing layer knows where to send the request. AWS released a container this month, ghcr.io/llm-d/llm-d-aws, with Elastic Fabric Adapter, libfabric, and NIXL integration for multi-node disaggregated inference. That is also real — a hyperscaler shipping production infrastructure for a project that six months ago was five companies' internal tooling.

Priya Nagpurkar, vice president of AI platform at IBM Research, and Vita Bortnikov, an IBM Fellow specializing in distributed AI inferencing at IBM Research, wrote the announcement post. Brian Stevens, Red Hat's senior vice president and AI chief technology officer, and Robert Shaw, director of engineering at Red Hat, gave interviews. The contributors are named. The repository is public. This is the part that usually looks like an announcement and nothing else — and it is an announcement. But announcements do not usually come with a governance hand-off attached.

That hand-off is where the CNCF move becomes worth watching. Kubernetes is the substrate, the Gateway API is the control plane, and GAIE is the extension point making inference routing composable with the rest of the cloud-native stack. When Red Hat, Google, IBM, CoreWeave, and NVIDIA agree that the right home for this work is a foundation where none of them has a veto, it signals something about the maturity of the problem they are solving. Inference disaggregation has graduated from research pattern to production concern. The companies that built llm-d are now betting that a neutral steward makes the stack more widely adoptable than a proprietary fork would be.

The bet is not without precedent in the CNCF ecosystem, and it is not without risk. Projects donated to foundations can slow down when governance processes mature, or speed up when commercial interests align with the roadmap. llm-d's founding contributors are explicit that they want to avoid fragmentation — the same concern that drove Kubernetes itself into the CNCF in 2015. Whether that concern translates into coherent governance or into a vendor-neutral shell with a commercially-captured roadmap is the open question for the project.

KServe v0.16 ships the LLMInferenceService, which wraps llm-d's primitives for model serving. The CNCF sandbox designation means the project gets the foundation's brand and legal infrastructure, not its guarantee of longevity. Sandbox projects have graduated, and sandbox projects have stalled. What the donation buys llm-d is time — the ability to build contributor diversity and community before any single vendor's commercial interests become the only voice in the room.

For builders and investors, the relevant read is this: disaggregated inference is the production architecture for serving LLMs at scale, and llm-d is the closest thing the Kubernetes ecosystem has to a canonical implementation. The CNCF move is a signal that the founding companies want it to become infrastructure in the formal sense — something that outlives their individual product cycles. Whether that signal is genuine or strategic depends on who shows up to the technical oversight committee. Watch that body. It is where the real architecture decisions will get made.

Sources: CNCF blog announcing llm-d sandbox acceptance; IBM Research donation post; Red Hat blog; Google Cloud blog (GKE Inference Gateway benchmarks); AWS blog (llm-d-aws container); Silicon Angle; KServe v0.16 documentation.