DeepSeek-R1 Technical Analysis: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
In previous 5 blogs, I explained 5 key techniques DeepSeek model has used to reduce the training cost and improve model accuracy:
- Mixture-of-Experts which reduced the training cost and improved the inference efficiency.
- Multi-Head Latent Attention which reduced the KV cache for the attention part.
- Multi-Token Prediction which improved the performance(accuracy) of the model.
- DualPipe which improved the computation-to-communication ratio and efficiency of the large scale GPUs cluster.
- FP8 Training which reduced the training cost further through low precision training.
In this blog, I’ll focus on the special techniques DeepSeek-R1 model has used that improved the model’s capability, especially the Unsupervised Reinforcement Learning that improved the reasoning capability significantly.
DeepSeek-R1-Zero: Emerging Remarkable Reasoning Capabilities through Pure Reinforcement Learning
DeepSeek-R1-Zero is produced by using DeepSeek-V3-Base as the base model and employing GRPO(Group Relative Policy Optimization) as the Reinforcement Learning framework to improve model performance in reasoning. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks.
In the DeepSeek-R1-Zero study, it demonstrated that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start.
In order to save the training costs of RL, the GRPO is adopted during the RL training. GRPO is a variant of PPO(Proximal Policy Optimization), it eliminates the value model in PPO, instead estimate the base line from group scores, which significantly reduced training resources compared to PPO. I probably will use another blog to compare PPO and GRPO in detail.
The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, authors adopted a rule-based reward system that mainly consists of two types of rewards: accuracy rewards and format rewards. The accuracy reward model evaluates whether the response is correct. In the case of math problems with deterministic results, the model is required to provide the final answer in a specified format. For LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. The format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.
The performance trajectory of DeepSeekR1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances.
The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone.
Aha Moment of DeepSeek-R1-Zero
A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in following diagram, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
DeepSeek-R1: Reinforcement Learning with Cold Start
Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, DeepSeek team explored DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.
To prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 the DeepSeek team constructed and collected a small amount of long CoT data to fine-tune the model as the initial RL actor. In this work, the DeepSeek team 1) collected thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL.
After fine-tuning DeepSeek-V3-Base on the cold start data, the DeepSeek team applied the same 2) large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero.
When reasoning-oriented RL converges, the DeepSeek team utilized the resulting checkpoint to collect 3) SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.
To further align the model with human preferences, the DeepSeek team implemented a 4) secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities.
Distillation: Empower Small Models with Reasoning Capability
To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, the DeepSeek Team directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1. Their findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models.
My Comments
DeepSeek-V2 and V3 proposed and practiced a lot of optimizations and innovations like Fine-Grained Mixture-of-Experts, Multi-Head Latent Attention, Multi-Token Prediction, DualPipe, FP8 low precision training which reduced the training cost significantly and improved the model accuracy.
While the DeepSeek-R1 model is notably the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
In the training dataset’s perspective, garbage in, garbage out. High quality training dataset is crucial for training a high quality model. But getting high quality training dataset is costly, Scale AI is a company that provide high quality training dataset for other AI giant. As far as I know, Scale AI is still relay on a large amount of human labors to generate annotated training dataset. How to generate high quality training dataset effectively and efficiently always is an valueble researching topic for AI. During the training of DeepSeek-R1, the DeepSeek team also mentioned they expanded their fine-tuning dataset through the RL model itself.