During KubeCon, Microsoft announced that it supports Retrieval Augmented Generation (RAG) in KAITO on Azure Kubernetes Service (AKS) clusters. In addition, vLLM is available as standard with the AI toolchain operator add-on.
Adding RAG support in KAITO is an important step for developers who want to implement advanced search capabilities on their AKS clusters. With this feature, users can deploy the RAG engine within minutes with a supported embedding model to index and search large datasets. This is done via a KAITO inference service URL.
Higher processing speed with vLLM
Another improvement is that the AI toolchain operator add-on now implements model inference workloads with the vLLM serving engine by default. According to Microsoft, this engine offers a significant acceleration in processing incoming requests. This allows developers to use OpenAI-compatible APIs, DeepSeek R1 models and various pre-trained HuggingFace models.
For developers who prefer HuggingFace Transformers to vLLM, Microsoft offers the option to switch between these engines at any time for KAITO inference deployments.
Customized GPU driver installation
The third update concerns skipping automatic GPU driver installation, a function that is now generally available. By default, AKS installs NVIDIA GPU drivers when a node pool is created with a VM size that supports NVIDIA GPUs. With this new option, users can now choose to install custom GPU drivers themselves or to use the GPU Operator on both Linux and Windows node pools.
Tip: Microsoft significantly expands Azure Kubernetes Service