2 min

Breakthroughs in scaling now allows even smaller developers to work with large deep learning models. Microsofts new DeepSpeed tool is a breaktrough in scaling AI.

Microsoft has released a new version of its open-source DeepSpeed tool that it says will enable the creation of “extremely large” deep learning models with a trillion parameters, more than five times as many as the largest model currently in use. The new tool will also help developers working on smaller projects, according to the company. 

DeepSpeed is a software library for performing artificial intelligence training. The AI tool has already gone through multiple iterations since its release in February. The latest version has effectively increased the maximum size of the models it can train from 100 billion to more than a trillion parameters.

“Democratizing” AI development

Microsoft says DeepSpeed can now train a trillion-parameter language model using 100 of Nvidia Corp.’s previous-generation V100 graphics cards. Previously, such a task would have required 4,000 of Nvidia’s current-generation (and more expensive) A100 graphics cards a 100 days to complete.

These cost reductions will give a boost to AI projects across many sectors. Groups such as OpenAI that are working to push the envelope on the size of neural networks could use the tool to reduce the hardware costs associated with their work. Startups and smaller enterprises working on day-to-day applications of AI could also use DeepSpeed to build more powerful models with more parameters than their limited budgets would otherwise allow.

DeepSpeed “democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models,” Microsoft executives Rangan Majumder and  Junhua Wang wrote in a blog post.

New technologies drive cost efficiencies

The giant leap in DeepSpeed’s efficiency is due to new technologies. For example, “ZeRO-Offload” improves how many parameters AI training servers can handle by making creative use of the memory and CPUs in those servers. Another innovation, called “3D parallelism,” distributes work among the training servers in a way that increases hardware efficiency.

“3D parallelism adapts to the varying needs of workload requirements to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency,” Microsoft’s Majumder and Wang wrote. “In addition, its improved communication efficiency allows users to train multi-billion-parameter models 2–7x faster on regular clusters with limited network bandwidth.”