llm-d has been officially accepted as a CNCF Sandbox project. This places the project under the Linux Foundation’s management and establishes an open standard for AI inference across any accelerator and any cloud environment.
The Cloud Native Computing Foundation (CNCF) has accepted llm-d as an official Sandbox project. This places the distributed inference framework under the management of the Linux Foundation, giving organizations the assurance of building on a neutral, open standard. llm-d was launched in May 2025 as a joint initiative of Red Hat, Google Cloud, IBM Research, CoreWeave, and Nvidia, with one clear vision: any model, any accelerator, any cloud.
Since then, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI have joined as partners. The universities of California (Berkeley) and Chicago, well-known names in the vLLM and LMCache world, also support the project. With its CNCF admission, llm-d now gains the governance structure and open leadership that companies need to build on it seriously.
Kubernetes-native inferencing as a first-class workload
The project addresses a specific bottleneck: AI serving is stateful and latency-sensitive, while traditional service routing and autoscaling are completely blind to these factors. This leads to inefficient placement, cache fragmentation, and unpredictable latency. llm-d addresses this by serving as the primary implementation of the Kubernetes Gateway API Inference Extension (GAIE) and providing inference-aware traffic management via the Endpoint Picker (EPP).
Additionally, the framework offers Prefill/Decode Disaggregation. Prompt processing and token generation are split into separately scalable pods. Hierarchical KV cache offloading distributes memory load across GPU, CPU, and storage. The latest v0.5 release shows that llm-d maintains near-zero latency in a multi-tenant SaaS scenario and scales up to approximately 120,000 tokens per second.
Avoiding vendor lock-in is a core principle. Through model- and state-aware routing policies, llm-d directs requests to the most suitable hardware from Nvidia, AMD, or Google, improving metrics such as Time to First Token (TTFT) and token throughput. The project also aims to become the standard for open, reproducible inference benchmarks.