DeepSeek Technical Analysis — (2)Multi-Head Latent Attention
Background
The is the 2nd blog of my DeepSeek Model technical analysis series blog, for the whole background please refer to the 1st blog of this series “DeepSeek Technical Analysis — (1) MoE”. For those who want to skip this blog and jump to your interested topic of this DeepSeek series, here is the blog list:
- Mixture-of-Experts which reduced the training cost and improved the inference efficiency.
- Multi-Head Latent Attention which reduced the KV cache for the attention part.
- Multi-Token Prediction which improved the performance(accuracy) of the model.
- DualPipe which improved the computation-to-communication ratio and efficiency of the large scale GPUs cluster.
- FP8 Training which reduced the training cost further through low precision training.
- DeepSeek-R1: incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
In the last blog, I focused on the Mixture-of-Experts(MoE) part which split the Feed-Forward-Network into several sub-networks(experts), and just part of experts’ parameters are activated for each input token. In this way to reduce the training cost and improve inference efficiency. For example for the DeepSeek-V3 671B Model, 37B parameters are activated for each token.
In this blog, I’ll focus on the Multi-Head Latent Attention(MLA) part. The MLA is a variant of Multi-Head Attention(if you are not familiar with MHA, please refer to my transformer clear explanation blog here). DeepSeek Model adopted the MLA started from V2.
Multi-Head Attention
Multi-Head Attention is the key part of transformer model, the paper “Attention is All You Need — 2017 Google” proposed the attention mechanism to capture global dependencies between tokens in a sequence with parallel computation. This Attention Mechanism is a disruptive innovation that break the sequential constraint of RNN and CNN models which directly promoted the massive evolution of language model.
In the Multi-Head Attention, there is a dedicated projection matrix for each head of keys, values and queries respectively. For example there are 8 heads in the original transformer model(in the transformer paper), and there should be 8×3(keys, values and queries) = 24 separated projection matrices. Keys, Values and Queries will be projected to h different replicas respectively with their projection matrices. In above transformer model with 8 heads, keys will has 8 projected replicas, values will have 8 projected replicas, the same as queries.
Please refer to my transformer clear explanation blog for more details about Multi-Head Attention in the transformer architecture.
Multi-Query Attention
Fast Transformer Decoding: One Write-Head is All You Need — 2019 Google this paper pointed out that one major challenge with Transformer is the speed of incremental inference, and it is limited by the memory bandwidth necessary to reload the large “keys” and “values” tensors which encode the state of the attention layers. And this paper introduced the Multi-Query Attention which let the all heads share the same keys and values. For above 8 heads attention case, keys will have only 1 replica, values will have only 1 replica, queries still have 8 projected replicas. After adopting the Multi-Query Attention, the decoder incremental inference speed increased by 13.9 times (baseline has 8 heads).
Grouped-Query Attention
Although the Multi-Query Attention can improve the incremental inference speed of transformer significantly, GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Dec 2023 Google mentioned that the Multi-Query Attention can lead to quality degradation and training instability. In order to overcome this quality sacrifice, this paper introduced the Grouped-Query Attention.
Instead of sharing the keys and values for all heads, a group of heads share one projection matrix for the keys and values. This paper tested the MQA in T5(Text-to-Text transfer) transformer model, T5-Large has 16 attention heads, T5-XXL has 64 attention heads. The paper choose 8 groups for the keys/values (8 heads share the same keys/values projection matrix). This paper showed that the T5 model with Grouped-Query Attention has 5~6x inference speed and similar quality(from the following diagram, we still can see there is slightly quality degradation, even it is very tiny) compare to T5 model with the traditional Multi-Head Attention.
Multi-Head Latent Attention(MLA)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model introduced the Multi-Head Latent Attention(MLA) for the attention module. “The MLA utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. The core of MLA is the low-rank joint compression for keys and values to reduce KV cache.”
The low-rank compression is a technique used to reduce the computational and memory requirements of deep learning models. There are a lot papers about it, like:
- Low-rank Compression of Neural Nets: Learning the Rank of Each Layer UC Merced
- Compressing Large Language Models using Low Rank and Low Precision Decomposition Stanford
- LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy Gatech & Microsoft
DeepSeek V1 adopted Rotary Position Embedding (RoPE) to carry tokens’ positional information (Transformer adopted sinusoidal positional encoding, please refer to my Transformer Clear Explanation blog here for the detail). Because of the incompatibility between low-rank KV compression and RoPE, DeepSeek V2 used the decoupled RoPE strategy that uses additional multi-head queries and a shared key to carry RoPE.
The paper said MLA shows better performance than MHA, and the MLA requires a significantly smaller amount of KV cache (14% for small MoE models and 4% for large MoE models) than MHA.
My Comments
The quality of MLA surpass the MHA in the above test report surprised me, I thought the MLA would have similar or slightly worse quality comparing with MHA. Because MHA for each head has its own projected keys, values and queries which make the attention module to easily learn to attend in different position. Is it possible because the “Decoupled RoPE” use additional multi-head queries and a shared key to carry RoPE? Emm, I have no answer yet.