Nvidia runs MoE models ten times faster

Nvidia runs MoE models ten times faster

Nvidia has published new benchmark results showing that its latest AI server platform, the GB200 NVL72, significantly improves the performance of modern mixture-of-experts (MoE) models.

According to the company, recent models, including Moonshot AI’s Kimi K2 Thinking and DeepSeek’s models, run up to 10 times faster than on previous-generation systems.

Mixture-of-experts models assume that not all parts of a large language model need to be deployed at once. A prompt is divided into sub-questions that are processed by specialized sub-models, the experts. Only the most relevant experts are activated. This reduces computing costs while increasing model capacity.

Mixture-of-experts models are rapidly gaining ground

The approach gained widespread attention after DeepSeek demonstrated in early 2025 that an efficiently designed MoE model could compete with models that required much more GPU time. Since then, OpenAI, Mistral AI, Moonshot AI, and others have incorporated the architecture into their latest-generation models.

Nvidia attributes the performance gains of the NVL72 to the system’s scalability, with 72 GPUs linked within a single node, and to improved NVLink connections between those chips. This should enable more efficient routing between active experts and better parallel execution than previous generations of servers.

Nvidia’s announcement mentions several models to illustrate these technical gains. These include models from China, such as those from Moonshot AI and DeepSeek. Nvidia does not provide any specific interpretation or geographical context, but presents the results as examples of workloads that benefit from the new server architecture. The Reuters report places this in the broader context of international AI developments, noting that models from China are becoming increasingly visible and are regularly used to test the performance of new hardware.

Nvidia has a strong position in training

The announcement comes at a time when the sector’s focus is shifting from training to large-scale implementation of models for end users. Nvidia has traditionally had a strong position in training, but is facing more competition in inference from AMD and Cerebras, among others, who are also working on systems that integrate multiple powerful chips into a single platform. These systems are expected to hit the market next year.

The new figures show that a variety of frontier models can scale well on the GB200 platform. At the same time, companies in China are developing their own AI hardware or moving training work to foreign data centers where advanced chips are available. According to Nvidia, the performance of MoE models on the NVL72 clearly shows that the new server architecture is suitable for models across different generations and origins.