DeepSeek Technical Analysis — (1) Mixture-of-Experts

7 min readJan 28, 2025

Background

The DeepSeek LLM Model is attracting massive attention and eyeballs now, due to its significant improvement in training costs and inferencing efficiency without compromising accuracy. As an engineer with high curiosity of understanding new techniques, there are several questions I want to reveal:

1) What special techniques contribute to this significant improvement of DeepSeek Model?
2) Which technique is disruptive innovation, which technique is engineering optimization?
3) Is this type of improvement still have room to be more better?

Before I start the technical analysis, I’d like to express my appreciation to DeepSeek’s open-source spirit by sharing so much technique materials with valuable details. As an employee working in an open source company (tidb.io) for nearly 10 years, I totally understand what this means to the community and the whole industry’ prosperity.

The curiosity of finding these answers led me to read most public materials of DeepSeek and related papers. I list these papers/materials here in date ascent order in case of you are interested with them.

I will use a series of blogs to dive deeply into these key techniques like Mixture-of-Experts(MoE), Multi-Head Latent Attention(MLA), Multi-Token Prediction and more one by one which contribute to the outstanding efficiency and performance of DeepSeek Model.

Mixture-of-Experts which reduced the training cost and improved the inference efficiency.
Multi-Head Latent Attention which reduced the KV cache for the attention part.
Multi-Token Prediction which improved the performance(accuracy) of the model.
DualPipe which improved the computation-to-communication ratio and efficiency of the large scale GPUs cluster.
FP8 Training which reduced the training cost further by adopting this low precision training.

Basic Architecture of DeepSeek Model

Basic Architecture of DeepSeek (v2&v3&r1) Model

DeepSeek follows the transformer architecture used by GPT, LLaMA and other LLMs(If you are not familiar with transformer, please refer to my last blog “Transformer Clearly Explained: Attention is All You Need”). There are L stacked identical transformer layers, each layer has a Masked Multi-Head Attention layer and a Feed-Forward Network layer. For the Multi-Head Attention layer, DeepSeek (start from V2) adopted the low-rank key-value joint compression technique to reduce KV cache size(please refer to my last transformer blog about the concept of Queries, Keys and Values in the Attention Mechanism). For the Feed-Forward Network layer, DeepSeek adopted the Mixture-of-Experts(MoE) technique to enable training strong models at an economical cost through sparse computation.

In this blog, I’ll focus on the Mixture-of-Experts(MoE) part.

Mixture-of-Experts(MoE)

Exploiting scale in both training data and model size has been central to the success of deep learning. When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy.

But with the capacity of the model scale, the training cost and inferencing cost increase accordingly. Because as the number of model’s parameters grow, the activated parameters also increase for handling each token during training and inferencing.

In order to scale the model size, meanwhile keep the training and inferencing cost at a constant level, conditional computation introduced in ML. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Jan 2017 Google introduced MoE layer for its language model. The basic idea of MoE is split the FFN into multiple sub-networks(experts), for each input token, only part of sub-networks(experts) are activated. Different sub-networks behavior as different “experts”, during training they absorb different information and knowledge from the dataset, during inferencing only part of experts are activated based on the input token.

There are 2 key components in the MoE layer: Gating Network and Expert Networks. The Gating Network decides which “experts” should be activated for an input token, and then these experts handle the input token produce output for next layer during training and inferencing. The Gating Network will chose TopK experts to be activated for each input token, this is called “TopK Gating”. Both the Gating Network and Expert Networks are trained by simple back-propagation. This paper also shared how to handle workload balancing issue between different experts, resolving the shrinking batch problem, and solutions to other practical issues.

DeepSeekMoE

DeepSeek adopted the DeepSeekMoE layer for its FFN part start from V2. DeepSeekMoE is a variant of MoE with 2 changes:

Finely segmenting the experts into N ones and activating mK from them, allowing for a more flexible combination of activated experts;
Isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models — Jan 2024 this paper described more details about the DeepSeekMoE.

This paper mentioned that the conventional TopK MoE has Knowledge Hybridity and Knowledge Redundancy. Knowledge Hybridity: existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus tokens assigned to a specific expert will be likely to cover diverse knowledge. Consequently, the designated expert will intend to assemble vastly different types of knowledge in its parameters, which are hard to utilize simultaneously. (2) Knowledge Redundancy: tokens assigned to different experts may require common knowledge. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters. These issues collectively hinder the expert specialization in existing MoE practices, preventing them from reaching the theoretical upper-bound performance of MoE models. By finely segmenting to more experts and introducing shared experts, DeepSeekMoE mitigated above tow issues.

Above diagram shows the DeepSeekMoE 16B (DeepSeek V1 with DeepSeekMoE) achieves comparable performance with LLaMa2 7B, but the LLaMA2 7B has 2.5 times activated parameters.

My Comments

Mixture-of-Experts is a great way to let different sub-networks of the LLM absorb and learn different domain knowledge, each parameter in the model is highly specialization for specific domain, and per activated parameter is highly contributed to the problem/question. This is like how our brain works, different neurons handles different tasks.

Does DeepSeekMoE architecture have push the MoE’s potentials to its boundary? The paper said “at the scale of about 2B parameters and 100B training tokens, the performance of DeepSeekMoE aligns closely with the theoretical upper bound of MoE models.” And this conclusion is based on comparison with the dense model (A dense model means all model’s parameters are activated for each input token). I believe there exits a MoE architecture that is more efficient than current finely segmenting + shared experts MoE, because I think 1) the dense model is not the boundary; 2) currently each expert has the same numbers of parameters, but the capacity and complexity of knowledge in different domain may various, choosing the same number of experts with the same numbers of parameters to handle these various questions identically is not the most optimal way in terms of efficiency. Our brain uses different part of neurons and different numbers of neurons to handles different type of tasks, as we know more about how our brain works, I think we may come out more optimizations to improve the efficiency of MoE.

At last, the efficiency improvement of LLM definitely will promote the prosperity of AI. From the model perspective, there will be more companies and organizations contribute to the improvement of LLM. The cost reduction of LLM API will benefit applications, we will see more and more AI applications appear in different industries to help us to improve our efficiency and handle knowledge tasks.