3 min Devops

DeepSeek-V3 overcomes challenges of Mixture of Experts technique

Qualitative answers in an energy-efficient way

DeepSeek-V3 overcomes challenges of Mixture of Experts technique

DeepSeek is releasing the third version of its model as an open-source product. The model contains 671 billion parameters but doesn’t deploy them all simultaneously when providing responses.

DeepSeek, a Chinese AI developer, competes with commercial developers through open-source products. The company is regularly successful in doing so. Like its most recent development with DeepSeek-V3, which is available for download via Hugging Face.

The model improves upon its predecessor and surpasses Llama 3.1 405B and Qwen2.5 72B in benchmarks, particularly excelling in coding tasks and mathematical calculations. While it slightly underperforms compared to Anthropic and OpenAI models, it introduces innovative features that will contribute to future LLM development.

Mixture of Experts

DeepSeek-V3 is based on a MoE (Mixture of Experts) architecture. This is a technique that has already proven successful for other players. For example, Microsoft successfully launched the Phi-3.5 models based on this technique last summer.

The Mixture of Experts technique incorporates multiple specialized models called “experts,” each with distinct domain expertise. Based on the input query or prompt, the system communicates with the most suitable model to deliver optimal results. This provides the user with the best possible result.

More energy efficient

This approach enhances efficiency and reduces hardware requirements. Although the complete LLM contains 671 billion parameters, each individual model comprises 34 billion. This distribution makes query processing significantly more energy-efficient.

The MoE technique also provides training advantages. The model was trained on 14.8 trillion tokens over 2,788 thousand computing hours—relatively modest compared to other projects requiring tens of thousands of GPUs running for days. This training method also reduces developer costs. An aspect that still plagues OpenAI to this day.

Also read: OpenAI’s business model isn’t working as bankruptcy looms

Constraint addressed

All the efficiency of this technology comes with a downside. Previous developers ran up against the fact that data was unevenly distributed among the various “experts”. This could negatively impact the quality with which a search query is then answered.

DeepSeek claims to have developed a method to avoid these problems. This method calls it attention or attention and identifies the key elements in the sentence. While this technique isn’t new, DeepSeek’s implementation performs multiple passes to capture important details that might be missed initially. This is to identify important details that may have been overlooked on a first read.

Finally, DeepSeek-V3 deploys another trick to enable faster inference. In doing so, the model ensures that multiple tokens are always generated simultaneously. This is while other models handle tokens one by one.

Currently, the new version is offered at the same price as DeepSeek-V2. As of Feb. 8, this will change.