Running LLMs outside a datacenter is usually not a realistic prospect. Nvidia and Mistral are letting PC users run a new model that does work locally.
Mistral NeMo 12B is the name of the new AI model, presented this week by Nvidia and Mistral. “We are fortunate to collaborate with the NVIDIA team, leveraging their top-tier hardware and software,” said Guillaume Lample, cofounder and chief scientist of Mistral AI. “Together, we have developed a model with unprecedented accuracy, flexibility, high-efficiency and enterprise-grade support and security thanks to NVIDIA AI Enterprise deployment.”
The promise of the new AI model is significant. Whereas previous LLMs were tied to datacenters, Mistral NeMo 12B moves to workstations. And it does this without sacrificing performance, or well, that’s the promise.
Stumbling blocks
The main stumbling block is not the fact that laptops, desktops and even workstations wouldn’t be powerful enough for AI – speed isn’t really the core issue. The shortcoming is the amount of video memory available. A PC with a discrete GPU usually has 4, 6 or 8 GB at its disposal, with outliers for gaming or productivity purposes to 12, 16 or 24 GB. That’s not enough for the full-fledged versions of Meta’s Llama 3 or its alternatives, each requiring more than 80 GB or more.
However, users on the subreddit r/LocalLLaMA have been doing it for months: running an AI model on their own PC that really shouldn’t have fit. Normally, models from OpenAI, Meta, Google and their competitors have hardware requirements that only fit in a cloud environment. The trick used by these Reddit users is called quantization, allowing open-source models to function on lesser hardware. This “blurs” the parameters of an AI model, making it less accurate. When you boil it all down, the calculation performed to find Thus, there is a greater chance of AI hallucinations, or wrong answers.
Read more: How mobile AI works
Small models, big problems
For enterprise purposes, GenAI has yet to prove itself. A go-to solution from cloud vendors is to provide AI solutions that ensure data privacy. This can be done in various ways through private clouds or, for example, by applying strong security measures on data-in-use, but nothing beats a local workload. That is the only method that can be secured in a known, relatively simple way.
Mistral NeMo 12B comes as an Nvidia NIM microservice and is completely unsurprisingly optimized for Nvidia hardware. This containerized deployment makes it flexible and fast to use. “Models can be deployed anywhere in minutes instead of days,” Nvidia says.
For that matter, deployment still remains relatively limited. Only the most expensive workstations possess the required Nvidia L40S, RTX 4090 or RTX 4500 GPUs. The cheapest option among these is the 4090, which can be bought separately for just under 1,700 euros.
Also read: AMD targets Europe with acquisition of Silo AI for 665 million