DeepSeek Technical Analysis — (5) FP8 Training

Jinpeng Zhang
6 min readFeb 2, 2025

--

Background

The is the 5th blog of my DeepSeek Model technical analysis series blogs, for the whole background please refer to the 1st blog of this series “DeepSeek Technical Analysis — (1) MoE”. For those who want to skip this blog and jump to your interested topic of this DeepSeek series, here is the blog list:

  • Mixture-of-Experts which reduced the training cost and improved the inference efficiency.
  • Multi-Head Latent Attention which reduced the KV cache for the attention part.
  • Multi-Token Prediction which improved the performance(accuracy) of the model.
  • DualPipe which improved the computation-to-communication ratio and efficiency of the large scale GPUs cluster.
  • FP8 Training which reduced the training cost further through low precision training.
  • DeepSeek-R1: incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

In this blog, I’ll focus on the FP8 Training part adopted by DeepSeek (start from version V3) which reduced the training cost.

FP8-LM: Training FP8 Large Language Models

LLMs have demonstrated unprecedented capabilities in language comprehension and generation, leading to breakthroughs in reasoning, math, science, and many other tasks. However, training LLMs is extremely costly. For example, GPT-3 175B consumes several thousand PetaFLOP/s-days of compute for pre-training. This motivates the needs of reducing the training costs of LLMs, especially for the scaling of next-generation super-intelligent models. Low-precision training is one of the most promising directions to reduce the costs, as it can provide high speed, small memory footprint, and low communication overhead.

FP8-LM: Training FP8 Large Language Models — 2023 Microsoft this paper proposed an extremely optimized FP8 mixed-precision framework for LLM training. The core idea is to infiltrate FP8 compute, storage, and communication into the whole progress of large model training, making the forward and backward pass all used the low-precision FP8, thus largely reducing system workloads compared to previous frameworks.

Training LLMs with FP8 is non-trivial. The challenges stem from issues such as data underflow or overflow, coupled with quantization errors arising from the narrower dynamic range and reduced precision inherent in FP8 data formats. These challenges cause numerical instabilities and irreversible divergences throughout the training process. To tackle them, the FP8-LM proposed two techniques: precision decoupling and automatic scaling for preventing the loss of critical information. The former one involves decoupling the influence of data precision on parameters such as weights, gradients, optimizer states, and assigning reduced precision to components that are not precision sensitive. The latter one is to preserve gradient values within the representation range of FP8 data formats through the dynamic adjustment of tensor scaling factors, thereby alleviating underflow and overflow occurrences during all-reduce communication.

This paper applied FP8-LM for training GPT-7B, GTP-13B and GPT-175B, the result demonstrate the effectiveness of this FP8 methodology, yielding substantial benefits including a 29% to 39% reduction in real memory usage (e.g., 29% reduction for GPT-7B while 39% for GPT-175B ) and a notable 63% to 65% decrease in weight-related communication overhead compared to the prevalent BF16 mixed-precision training approach. It is noteworthy that during the training of GPT-175B model, the FP8 mix-precision framework reduces training time by 37% compared to TE (Nvidia, 2022b), while consuming 42% less memory on H100 GPU platform. Meanwhile, the FP8-LM doesn’t compromise any accuracy (refer to following training loss comparison diagram).

FP8 Training in DeepSeek

Inspired by existing works like FP8-LM, the DeepSeek team proposed a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. In this mixed precision framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability.

During the FP8 training of DeepSeek, the majority of core computation kernels, i.e., GEMM(General Matrix Multiplication) operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in following diagram, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. Despite the efficiency advantage of the FP8 format, certain operators still require a higher precision due to their sensitivity to low-precision computations. Besides, some low-cost operators can also utilize a higher precision with a negligible overhead to the overall training cost. For this reason, after careful investigations, DeepSeek maintains the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators.

As we have mentioned in the FP8-LM section, in low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. DeepSeek proposed a fine-grained quantization method that applies scaling at a more granular level to solve the low-precision training highly sensitive to outliers issue other scaling methods may encounter. For activations, DeepSeek groups and scale elements on a 1x128 tile basis (i.e., per token per 128 channels); and for weights, DeepSeek groups and scales elements on a 128x128 block basis (i.e., per 128 input channels per 128 output channels). The tile-wise grouping can achieve better precision, while the block-wise grouping can is more efficient.

Accumulation Precision refers to the bit precision used when summing multiple low-precision values in operations like General Matrix Multiplication (GEMM). When performing matrix multiplications, thousands or millions of numbers are multiplied and summed together. If the accumulation is done in a low precision format, errors can build up, leading to inaccurate results. Instead of using a higher precision (e.g., FP32) for accumulation, DeepSeek introduced “Increasing Accumulation Precision”. The basic progress of it is:

  1. Tensor Cores in modern NVIDIA GPUs (e.g., H100, A100) perform Matrix Multiply-Accumulate (MMA) operations at low precision (FP8) to improve speed and efficiency.
  2. Accumulation inside Tensor Cores is limited to 14-bit precision (instead of full FP32).
  3. Once an interval of 𝑁c is reached, these partial results will be copied to FP32 registers on CUDA Cores, where full-precision FP32 accumulation is performed.

Experiments showed that compared with BF16 baseline, the relative loss error of DeepSeek FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.

My Comments

The FP8-LM framework using statistical analysis to automatically adjust the tensor scaling factors to preserve gradient values within the representation range of FP8 data formats. While DeepSeek FP8 training adopted tile-wise and block-wise grouping for its fine-grained quantization. Both of them are practical good engineering optimizations.

--

--

Jinpeng Zhang
Jinpeng Zhang

Written by Jinpeng Zhang

Director of Engineering @ TiDB, focus on building large scale distributed system and high performance engineering team.

No responses yet