Databricks recently released a preview of GPU and LLM optimization for Model Serving. This should make deploying large AI models on the Lakehouse Platform easier, the company hopes.
The GPU and LLM optimization functionality for Model Serving has been shown in preview form. It provides automatic optimization for so-called LLM Serving and delivers high performance for it without human configuration actions.
Databricks defines the functionality as the first serverless GPU built on a unified data and AI platform. This should enable end users to develop generative AI solutions seamlessly within a single platform, from data ingestion to model deployment and monitoring.
The functionality allows users to deploy a multitude of AI models. Examples include natural language models, computer vision models, audio models or tabular or custom models.
According to Databricks, it does not matter how they are trained and with what type of data.
Reduced latency and costs
LLM models deployed via Model Serving are said to have up to 3.5 times less latency and likewise lower costs. It also achieves up to 2.5 times more throughput.
In the preview, Databricks Model Serving’s GPU and LLM optimization now automatically optimizes MPT and Llama 2 models. Other possible models will be added soon.