Following hot on the heels of Microsoft, Nvidia has also released a smaller AI model that can run locally on devices with less computing power. The Mistral-NeMo-Minitron 8B model is a scaled-down version of an earlier model developed in collaboration with French AI startup Mistral. The secret behind it involves two innovative techniques called ‘pruning and distilling’.
According to Kari Briski, head of Nvidia’s AI and HPC division, this model is small enough to run on RTX workstations. At the same time, it is powerful enough to pass benchmarks for robust AI chatbots, virtual assistants, content generators, and educational tools. It would even be suitable for laptops and edge devices. In other words, you don’t always need a giant of an LLM for tasks that don’t need it.
Two techniques were employed to keep the model small but still sufficiently effective. These reduced a larger model (i.e. the 12 billion parameter Mistral NeMo 12B, itself only a month old) to a considerably more manageable size. By applying ‘pruning,’ Nvidia removed unnecessary components from the code base that were unnecessary for the intended tasks.
Further training on a specific dataset
The next step involved ‘distilling,’ where the reduced model is further trained on a smaller, specific data set to increase accuracy. This method is cheaper and produces higher accuracy output for the tasks at hand compared to training an entirely new small language model.
The code for the model is available on Hugging Face under an open-source license. The model itself is available as an Nvidia NIM microservice, with associated API. A downloadable version is still coming that can run on any system with a sufficiently powerful GPU.
Microsoft is also experimenting with models that use hardware efficiently. Yesterday it announced three new variants of the Phi-3.5 line. Among them is a model that uses the Mixture of Experts technology for the first time in this line.
Read more: Microsoft has success with Mixture of Experts technology at Phi-3.5