DeepSeek Technical Analysis — (4)DualPipe
Background
The is the 4th blog of my DeepSeek Model technical analysis series blog, for the whole background please refer to the 1st blog of this series “DeepSeek Technical Analysis — (1) MoE”. For those who want to skip this blog and jump to your interested topic of this DeepSeek series, here is the blog list:
- Mixture-of-Experts which reduced the training cost and improved the inference efficiency.
- Multi-Head Latent Attention which reduced the KV cache for the attention part.
- Multi-Token Prediction which improved the performance(accuracy) of the model.
- DualPipe which improved the computation-to-communication ratio and efficiency of the large scale GPUs cluster.
- FP8 Training which reduced the training cost further through low precision training.
- DeepSeek-R1: incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
In this blog I’ll focus on the DualPipe part which let DeepSeek improve the computation-to-communication ratio and efficiency of the GPU cluster during training.
Forward and Backward Computation During Training
In deep learning, forward and backward propagation are the two main steps in training a neural network. In the forward stage, the input data is passed through the network layer by layer, each layer applies transformations and then produce the output. In the Backward stage, the loss is calculated using the loss function(e.g., cross entropy, MSE, etc), back-propagation computes the gradient of the loss with respect to weights, gradients are propagated backward, layer by layer, using the chain rule of differentiation. And then weights are updated using an optimization algorithm(e.g., Adam).
There are few things we should know before we go to the next section:
- For one input sample, the backward computation is after forward computation.
- For one input sample, in the forward stage, layerN handle it first and than pass to layerN+1; but in the backward stage, layerN+1 handle it first and then pass to layerN.
- Usually we feed a batch of samples to the neural network and then update weights of the network once.
Challenges of Large Scale Distributed Training
When the model is small and can be fitted into a single GPU memory, it is quite simple and efficient: feed samples, forward computation, backward computation, etc, the GPU computation resource can be well utilized. In order to accelerate the training of such small size to medium size model with more data, techniques like Data Parallelism(e.g., PyTorch Distributed — 2020 Meta AI) which replicate the whole model to multiple computational resources(GPUs), generate gradients independently and then communicate those gradients at each iteration to keep model replicas consistent.
When it comes to a large model, like LLM with hundreds of billions of parameters, the model is too large to fit into a single GPU, we need to partition the model and distributed it to a large cluster with hundreds of or thousands of GPUs (e.g., the GPT-3 model with 175B parameters was reported training across 10,000 A100 GPUs), this technique is called Model Parallelism. We can partition the model by layers (e.g., GPT-3 175B has around 96 layers), and distribute different layer to different GPUs. If a single layer is still too large to fit into a single GPU memory(besides parameters, optimizer states and gradients also need memory during training), this single layer can still be partitioned into several parts and assign each part to a dedicated GPU. Remember the Mixture-of-Experts I have introduced in the 1st blog? Mixture-of-Experts already split the feed-forward layer into independent partitions, we can distribute these experts across different GPUs, each Attention layer has multiple heads(GPT-3 175B has 96 heads), we can distribute these heads(and their MatMul computation) across different GPUs. And we can partition the model further from the tensor level.
Data Parallelism and Model Parallelism are not exclusive, they can be used together to accelerate the large scale model training process. After applied the Model Parallelism by partitioning the model, one partition of the model can be replicated and distributed to multiple GPUs and apply the data parallelism.
Everything looks good so far? Let’s talk about the challenges of large scale distributed training now. After the model and computations distributed across hundreds of or thousands of GPUs, how to fully utilize these large scale computational resources is a big challenge(there are other challenges like what if one machine crashed during training since with thousands of nodes in a single cluster, the probability of node crashing is higher than small clusters, how to recover this part from the checkpoint by like adding a new GPU/node from the idle HA pool back to the cluster or re-balance the model and training tasks for the topology, etc. In this blog, we focus on how to fully utilize the resource.). As I have described in the last section, there are dependencies between forward computation and backward computation, there are dependencies between layers’ computation, model has been partitioned across different nodes. Data transfer and Communication across nodes are required. Some nodes may be in idle state due to the dependencies to other nodes’ computational result which is still ongoing; gradients, output of one layer and other data may need to transfer from one node to another, it may hit the network bandwidth bottleneck and result in idle of some nodes. There might be some steps (e.g., optimizer step) that needs Synchronization, also can cause idle of some nodes.
Zero Bubble Pipeline Parallelism
In order to fully leverage the computational resources of the cluster during deep neural network training, several Pipeline Parallelism techniques were introduced. PipeDream: Fast and Efficient Pipeline Parallel DNN Training — 2018 (Microsoft, CMU, Stanford) this paper introduced the one-forward-one-backward(1F1B) schedule strategy to improve the GPUs utilization of a cluster through pipelining overlaps communication and computation. 1,2,3,4 in the following diagram indicates different minibatch of training data.
Zero Bubble Pipeline Parallelism — 2023 Sea AI indicated that the backward computation actually contains two parts: compute the gradient with respect to the input x (B) and the layer’s parameters W (W). The 1F1B strategy combine the B and W as B but unnecessarily increases the sequentially dependent computations. So Zero Bubble pipeline split the B and W into different stages, to reduce the bubbles in the pipeline. And replace the before-hand synchronizations with a post update validation to further reduce bubbles in the optimizer step (bottom part of follow diagram).
DualPipe in DeekSeek
DeekSeek (started from V3) introduced the DualPipe Schedule which has the similar idea as Zero Bubble Pipeline Schedule, but with some additional changes to further improve the computation-to-communication ratio and efficiency:
- Finer-Grained stages: divide each chunk into 4 components: Attention, all-to-all dispatch(Handles communication between devices), MLP(Multi-Layer Perceptron), all-to-all combine(merge output across devices). For a backward chunk, the attention and MLP split further into two parts: backward for input(B) and backward for weights(W) like Zero Bubble.
- Bidirectional pipeline scheduling which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped (See the 2 black arrows in following diagram). In order to support bidirectional pipeline scheduling, DualPipe requires keeping two copies of the model parameters. If we have 8 devices with a 8 layers model, in the Zero Bubble Schedule, each device has a corresponding layer. But in the DualPipe Schedule, in order to handle bidirectional pipeline, the device 0 should have model’s layer0 and layer7, and the device 7 should have model’s layer7 and layer0, .
In order to ensure sufficient computational performance for DualPipe, DeepSeek also customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. Please see DeepSeek-V3 Technical Report for more details.
My Comments
The creative DualPipe schedule, plus their excellent infra-level engineering optimizations makes DeepSeek can fully utilize the computational resources(GPUs) of the cluster. From this part, I can see the smartness and excellent engineering spirit of this team. This probably is partly because their limited resource compare to other LLM Giants like OpenAI, Meta, Google, etc.