When we launched llm-d in May 2025, we set out to bridge the enormous capability gap between AI experimentation and large-scale, mission-critical production inference. By integrating llm-d into the CNCF, we are expanding the goal of a multi-vendor coalition—including CoreWeave, IBM, Google, and NVIDIA—to establish the open standard for distributed inference.
Inference drives the agentive era
As we move further into an agentive future, AI inference that powers enterprise agents across various industries is poised for widespread adoption. It will be critical that the cost and complexity of inference do not outweigh the business value of the agents themselves. However, inference can be extremely expensive, consuming large amounts of specialized accelerators, and these costs can escalate even further at scale. llm-d's advanced capabilities directly address this, meeting enterprise Service Level Objectives (SLOs) while maximizing infrastructure efficiency. Furthermore, organizations need the flexibility to deploy inference wherever it makes sense—data center, cloud, or edge—on the hardware of their choice. This flexibility is only possible if the underlying ecosystem is built on open source and open standards.
Closing the gap in the cloud-native environment
Although Kubernetes is the industry standard for orchestration, it wasn't originally designed for the unique, stateful demands of large language model inference (LLM). In a traditional microservice, a request is a request: each replica can handle it equally well. In generative AI, the cost of a request varies enormously depending on the length of the input and output tokens, the size and architecture of the model, the cache locality, and whether the model is in the preloading (compute-bound) or decoding (memory-bound) phase.
Standard service routing is blind to these dynamics, leading to inefficient allocation and unpredictable latency. This is where llm-d bridges the gap. It functions as a specialized data plane orchestration layer between high-level control planes like KServe and low-level engines like vLLM. By leveraging native Kubernetes building blocks such as Gateway API and LeaderWorkerSet (LWS), it transforms complex distributed inference into a manageable and observable cloud-native workload.
Strengthening the ecosystem through contribution
By making llm-d available to the CNCF, we are establishing well-defined pathways—proven and replicable designs that transform fragmented AI components into modular and interoperable microservices. This contribution is more than just a single project; it's about enriching the entire cloud-native landscape so that inference becomes an integral part of the same environment as traditional container-based applications.
A central part of this work is the endpoint picker (EPP). llm-d acts as a key implementation for the Inference Extension API (GAIE), and the EPP enables programmable, inference-aware routing. This means the system makes routing decisions based on the actual state of the engine, optimizing KV cache hit rates and hardware accelerator features. This is a critical requirement for maintaining sustained performance under strict service-level objectives.
llm-d complements and extends the existing set of solutions within the CNCF:
● Kubernetes: Provides the key infrastructure platform for AI workloads.
● Gateway API: Drives upstream alignment for AI-specific routing, ensuring that traffic management remains an open core component.
● KServe: Acts as the high-level control plane that integrates with llm-d to support advanced features such as disaggregated service and prefix caching.
● LeaderWorkerSet: Leverages native Kubernetes building blocks to orchestrate complex multi-node replication and expert parallelism, transforming engines like vLLM into manageable cloud-native workloads.
● Prometheus & Grafana: Exports specialized metrics such as time to first token (TTFT) to bring enterprise observability to generative AI.
Climbing together the future of inference
Collaboration has been at the heart of llm-d since its inception. When we announced llm-d last year at the Red Hat Summit, the combined efforts of the project's founding contributors, industry leaders, and academics were a source of pride for Red Hat, not only for launching llm-d but also for establishing a collaborative and future-proof foundation. In the 10 months since then, llm-d has been adopted for both private enterprise AI MaaS (Model-as-a-Service) and large-scale AI initiatives. More importantly, the project's open roots continue to deepen with a growing ecosystem of contributors and partners. Developers and enterprises are placing their trust in llm-d, and making the project available to the CNCF will support and sustain an open future. The path to successful open-source AI innovation is long, but together we are building the infrastructure to get there.
Author: By Brian Stevens, Senior Vice President and Chief Technology Officer (CTO) for AI, Red Hat
